Data lake

Last updated April 19, 2024

A data lake is a system or repository of data stored in its natural/raw format,^[1] usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc.,^[2] and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).^[3] A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as Amazon, Microsoft, Oracle Cloud, or Google).

Background

James Dixon, then chief technology officer at Pentaho, coined the term by 2011^[4] to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data.^[5] In promoting data lakes, he argued that data marts have several inherent problems, such as information siloing. PricewaterhouseCoopers (PwC) said that data lakes could "put an end to data silos".^[6] In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository."

Examples

Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop distributed file system (HDFS).^[7] There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at Cardiff University is a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.^[8]

An earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing (Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant one had to have expertise in Java with map reduce and higher-level tools like Apache Pig, Apache Spark and Apache Hive (which by themselves were originally batch-oriented).

Criticism

Poorly-managed data lakes have been facetiously called data swamps.^[9]

In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data".^[10] PwC was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics:

We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.^[6]

They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization.

Another criticism is that the term "data lake" is not useful because it is used in so many different ways. ^[11] It may be used to refer to, for example: any tools or data management practices that are not data warehouses; a particular technology for implementation; a raw data reservoir; a hub for ETL offload; or a central hub for self-service analytics.

While critiques of data lakes are warranted, in many cases they apply to other data projects as well.^[12] For example, the definition of “data warehouse” is also changeable, and not all data warehouse efforts have been successful. In response to various critiques, McKinsey noted^[13] that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.

Data lakehouses

Data lakehouses are a hybrid approach that can ingest a variety of raw data formats like a data lake, yet provide ACID transactions and enforce data quality like a data warehouse.^[14]^[15] A data lakehouse architecture attempts to address several criticisms of data lakes by adding data warehouse capabilities such as transaction support, schema enforcement, governance, and support for diverse workloads. According to Oracle, data lakehouses combine the "flexible storage of unstructured data from a data lake and the management features and tools from data warehouses".^[16]

Related Research Articles

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

Business intelligence software is a type of application software designed to retrieve, analyze, transform and report data for business intelligence. The applications generally read data that has been previously stored, often - though not necessarily - in a data warehouse or data mart.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

Dell EMC Isilon is a scale out network-attached storage platform offered by Dell EMC for high-volume storage, backup and archiving of unstructured data. It provides a cluster-based storage array based on industry standard hardware, and is scalable to 50 petabytes in a single filesystem using its FreeBSD-derived OneFS file system.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS and MapReduce technology. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming architecture framework that supports in-storage parallel data processing for data stored in Sector. Sector/Sphere operates in a wide area network (WAN) setting.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

Cloudera, Inc. is an American data lake software company.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Cloud analytics is a marketing term for businesses to carry out analysis using cloud computing. It uses a range of analytical tools and techniques to help companies extract information from massive data and present it in a way that is easily categorised and readily available via a web browser.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

Apache IoTDB is a column-oriented open-source, time-series database (TSDB) management system written in Java. It has both edge and cloud versions, provides an optimized columnar file format for efficient time-series data storage, and TSDB with high ingestion rate, low latency queries and data analysis support. It is specially optimized for time-series oriented operations like aggregations query, downsampling and sub-sequence similarity search. The name IoTDB comes from Internet of Things (IoT) Database, which means it was designed as an IoT-native TSDB that resolves the pain points of the typical IoT scenarios, including massive data generation, high frequency sampling, out-of-order data, specific analytics requirements, high costs of storage and operation & maintenance, low computational power of IoT devices.

References

↑ "The growing importance of big data quality". The Data Roundtable. 21 November 2016. Retrieved 1 June 2020.
↑ "What is a data lake?". aws.amazon.com. Retrieved 12 October 2020.
↑ Campbell, Chris. "Top Five Differences between DataWarehouses and Data Lakes". Blue-Granite.com. Archived from the original on 14 March 2016.
↑ Woods, Dan (21 July 2011). "Big data requires a big architecture". Forbes .
↑ Dixon, James (14 October 2010). "Pentaho, Hadoop, and Data Lakes". James Dixon’s Blog. James Dixon. Retrieved 7 November 2015. If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
1 2 Stein, Brian; Morrison, Alan (2014). Data lakes and the promise of unsiloed data (PDF) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCoopers.
↑ Tuulos, Ville (22 September 2015). "Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances". NextRoll.
↑ Walker, Coral; Alrehamy, Hassan (2015). "Personal Data Lake with Data Gravity Pull". 2015 IEEE Fifth International Conference on Big Data and Cloud Computing. pp. 160–167. doi:10.1109/BDCloud.2015.62. ISBN 978-1-4673-7183-4. S2CID 18024161.
↑ Olavsrud, Thor (8 June 2017). "3 keys to keep your data lake from becoming a data swamp". CIO. Retrieved 4 January 2021.
↑ Needle, David (10 June 2015). "Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques". Enterprise Apps. eWeek. Retrieved 1 November 2015. Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.^{[ permanent dead link ]}
↑ "Are Data Lakes Fake News?". Sonra. 8 August 2017. Retrieved 10 August 2017.
↑ Belov, Vladimir; Kosenkov, Alexander N.; Nikulchev, Evgeny (2021). "Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes". Applied Sciences. 11 (18): 8651. doi: 10.3390/app11188651 .
↑ "A smarter way to jump into data lakes". McKinsey. 1 August 2017.
↑ What is a Data Lakehouse? | Databricks
↑ What is a Data Lakehouse? | Snowflake
↑ What is a Data Lakehouse? | Oracle

3

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "The growing importance of big data quality". The Data Roundtable. 21 November 2016. Retrieved 1 June 2020.

[2] "What is a data lake?". aws.amazon.com. Retrieved 12 October 2020.

[3] Campbell, Chris. "Top Five Differences between DataWarehouses and Data Lakes". Blue-Granite.com. Archived from the original on 14 March 2016.

[woods2011-4] Woods, Dan (21 July 2011). "Big data requires a big architecture". Forbes .

[dixon2010-5] Dixon, James (14 October 2010). "Pentaho, Hadoop, and Data Lakes". James Dixon’s Blog. James Dixon. Retrieved 7 November 2015. If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

[stein2014-6] 1 2 Stein, Brian; Morrison, Alan (2014). Data lakes and the promise of unsiloed data (PDF) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCoopers.

[tuulos2015-7] Tuulos, Ville (22 September 2015). "Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances". NextRoll.

[8] Walker, Coral; Alrehamy, Hassan (2015). "Personal Data Lake with Data Gravity Pull". 2015 IEEE Fifth International Conference on Big Data and Cloud Computing. pp. 160–167. doi:10.1109/BDCloud.2015.62. ISBN 978-1-4673-7183-4. S2CID 18024161.

[9] Olavsrud, Thor (8 June 2017). "3 keys to keep your data lake from becoming a data swamp". CIO. Retrieved 4 January 2021.

[needle2015-10] Needle, David (10 June 2015). "Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques". Enterprise Apps. eWeek. Retrieved 1 November 2015. Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.^{[ permanent dead link ]}

[11] "Are Data Lakes Fake News?". Sonra. 8 August 2017. Retrieved 10 August 2017.

[12] Belov, Vladimir; Kosenkov, Alexander N.; Nikulchev, Evgeny (2021). "Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes". Applied Sciences. 11 (18): 8651. doi: 10.3390/app11188651 .

[13] "A smarter way to jump into data lakes". McKinsey. 1 August 2017.

[14] What is a Data Lakehouse? | Databricks

[15] What is a Data Lakehouse? | Snowflake

[16] What is a Data Lakehouse? | Oracle

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]