Lakehouse
   HOME

TheInfoList



OR:

A data lake is a system or repository of data stored in its natural/raw format, usually object
blob Blob may refer to: Science Computing * Binary blob, in open source software, a non-free object file loaded into the kernel * Binary large object (BLOB), in computer database systems * A storage mechanism in the cloud computing platform Mic ...
s or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting,
visualization Visualization or visualisation may refer to: *Visualization (graphics), the physical or imagining creation of images, diagrams, or animations to communicate a message * Data visualization, the graphic representation of data * Information visualiz ...
, advanced analytics and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. A data lake can include structured data from
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s (rows and columns), semi-structured data ( CSV, logs,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
,
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
), unstructured data ( emails, documents, PDFs) and binary data (images,
audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound *Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum *Digital audio, representation of sound ...
, video). A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as
Amazon Amazon most often refers to: * Amazons, a tribe of female warriors in Greek mythology * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon River, in South America * Amazon (company), an American multinational technology c ...
,
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
, or
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
).


Background

James Dixon, then chief technology officer at
Pentaho Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was ...
, coined the term by 2011 to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data. In promoting data lakes, he argued that data marts have several inherent problems, such as
information silo An information silo, or a group of such silos, is an insular management system in which one information system or subsystem is incapable of reciprocal operation with others that are, or should be, related. Thus information is not adequately shared ...
ing.
PricewaterhouseCoopers PricewaterhouseCoopers is an international professional services brand of firms, operating as partnerships under the PwC brand. It is the second-largest professional services network in the world and is considered one of the Big Four accounting ...
(PwC) said that data lakes could "put an end to data silos". In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository." Hortonworks,
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
,
Oracle An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination. Description The word '' ...
,
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
,
Zaloni Zaloni is a privately owned, software, and services company headquartered in Durham, North Carolina, United States with offices in Guwahati, Assam, India and Bangalore, Karnataka, India. provides DataOps software for big data scale-out architect ...
,
Teradata Teradata Corporation is an American software company that provides cloud database and analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech a ...
, Impetus Technologies, Cloudera,
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
, and Amazon Web Services all used the term by 2016.


Examples

Many companies use cloud storage services such as
Google Cloud Storage Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing ...
and Amazon S3 or a distributed file system such as
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
distributed file system (HDFS). There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at
Cardiff University , latin_name = , image_name = Shield of the University of Cardiff.svg , image_size = 150px , caption = Coat of arms of Cardiff University , motto = cy, Gwirionedd, Undod a Chytgord , mottoeng = Truth, Unity and Concord , established = 1 ...
is a new type of data lake which aims at managing
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
of individual users by providing a single point of collecting, organizing, and sharing personal data. An earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing ( Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant one had to have expertise in Java with map reduce and higher-level tools like Apache Pig,
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
and Apache Hive (which by themselves were originally batch-oriented).


Criticism

Poorly-managed data lakes have been facetiously called data swamps. In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
".
PwC PricewaterhouseCoopers is an international professional services brand of firms, operating as partnerships under the PwC brand. It is the second-largest professional services network in the world and is considered one of the Big Four accounting ...
was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of
Cambridge Semantics Cambridge Semantics is a privately held company headquartered in Boston, Massachusetts with an office in San Diego, California. The company is an enterprise big data management and exploratory analytics software company. History Cambridge Semant ...
: They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
are important to the organization. Another criticism is that the term "data lake" is not useful because it is used in so many different ways. It may be used to refer to, for example: any tools or data management practices that are not
data warehouses In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integr ...
; a particular technology for implementation; a raw data reservoir; a hub for ETL offload; or a central hub for self-service analytics. While critiques of data lakes are warranted, in many cases they apply to other data projects as well. For example, the definition of “data warehouse” is also changeable, and not all data warehouse efforts have been successful. In response to various critiques, McKinsey noted that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.


See also

*
Azure Data Lake Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud. History Azure Data Lake service was released on November 16, 2016. It is based on COSMOS, which is used to store and ...


References

{{Reflist Data management Cloud storage