Data lake

Last updated

A data lake is a system or repository of data stored in its natural/raw format, [1] usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). [2] A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as Amazon, Google and Microsoft).

Contents

A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value. [3]

Background

James Dixon, then chief technology officer at Pentaho, allegedly coined the term [4] to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data. [5] In promoting data lakes, he argued that data marts have several inherent problems, such as information siloing. PricewaterhouseCoopers said that data lakes could "put an end to data silos." [6] In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository." Hortonworks, Google, Oracle, Microsoft, Zaloni, Teradata, Impetus Technologies, Cloudera, and Amazon now all have data lake offerings. [7]

Examples

Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. [8] There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at Cardiff University is a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data. [9] [10] An earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing (MapReduce) and was the only processing paradigm associated with it. Interacting with the data lake meant one had to have expertise in Java with map reduce and higher level tools like Apache Pig, Apache Spark and Apache Hive (which by themselves were batch-oriented).

Criticism

In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data". [11] PricewaterhouseCoopers was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics,

We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there.
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents. [6]

They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization. Another criticism is that the concept is fuzzy and arbitrary. It refers to any tool or data management practice that does not fit into the traditional data warehouse architecture. The data lake has been referred to as a particular technology. The data lake has been labeled as a raw data reservoir or a hub for ETL offload. The data lake has been defined as a central hub for self-service analytics. The concept of the data lake has been overloaded with meanings, which puts the usefulness of the term into question. [12] .

While critiques of data lakes are warranted, in many cases they are overly broad and could be applied to any technology endeavor generally and data projects specifically. For example, the term “data warehouse” currently suffers from the same opaque and changing definition as a data lake. It can also be said that not all data warehouse efforts have been successful either. In response to various critiques, McKinsey noted [13] that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.

Related Research Articles

Data warehouse system used for reporting and data analysis

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

IBM Db2 Family Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. They initially supported the relational model, but were extended to support object-relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

Business intelligence software is a type of application software designed to retrieve, analyze, transform and report data for business intelligence. The applications generally read data that has been previously stored, often - though not necessarily - in a data warehouse or data mart.

Vertica company

Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker and Andrew Palmer. Palmer was the founding CEO. Ralph Breslauer and Christopher P. Lynch served as later CEOs.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

Cloudera, Inc. is a US-based software company that provides a software platform for data engineering, data warehousing, machine learning and analytics that runs in the cloud or on premises.

Apache Hive database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

SnapLogic is a commercial software company that provides Integration Platform as a Service (iPaaS) tools for connecting Cloud data sources, SaaS applications and on-premises business software applications. Headquartered in San Mateo, CA, SnapLogic was founded in 2006. SnapLogic is headed by Ex-CEO and Co-Founder of Informatica Gaurav Dhillon, and is venture backed by Andreessen Horowitz, Ignition Partners, Floodgate Fund, Brian McClendon, and Naval Ravikant. As of 2017, the company has raised $136.3 million.

HPCC open source, data-intensive computing system platform

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

Apache Drill open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Drill is an Apache top-level project.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. The name means to shift away from Oracle, red being an allusion to Oracle, whose corporate color is red and is informally referred to as "Big Red." It is built on top of technology from the massive parallel processing (MPP) data warehouse company ParAccel, to handle large scale data sets and database migrations. Redshift differs from Amazon's other hosted database offering, Amazon RDS, in its ability to handle analytic workloads on big data data sets stored by a column-oriented DBMS principle.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Cloud analytics is a marketing term for businesses to carry out analysis using cloud computing. It uses a range of analytical tools and techniques to help companies extract information from massive data and present it in a way that is easily categorised and readily available via a web browser.

Fluentd is a cross platform open-source data collection software project originally developed at Treasure Data. It is written primarily in the Ruby programming language.

Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. One can even query data from multiple data sources within a single query. Presto is community driven open-source software released under the Apache License.

BlueTalon, Inc. was a private enterprise software company that provides data-centric security, user access control, data masking, and auditing solutions for complex, hybrid data environments. BlueTalon was founded in 2013 and headquartered in Redwood City, California.

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

References

  1. "The growing importance of big data quality". The Data Roundtable. Retrieved 1 June 2020.
  2. Campbell, Chris. "Top Five Differences between DataWarehouses and Data Lakes". Blue-Granite.com. Retrieved 19 May 2017.
  3. Olavsrud, Thor. "3 keys to keep your data lake from becoming a data swamp". CIO. Retrieved 5 July 2017.
  4. Woods, Dan (21 July 2011). "Big data requires a big architecture". Tech. Forbes.
  5. Dixon, James (14 October 2010). "Pentaho, Hadoop, and Data Lakes". James Dixon’s Blog. James. Retrieved 7 November 2015. If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
  6. 1 2 Stein, Brian; Morrison, Alan (2014). Data lakes and the promise of unsiloed data (PDF) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper.
  7. Weaver, Lance (10 November 2016). "Why Companies are Jumping into Data Lakes". blog.equinox.com. Retrieved 19 May 2017.
  8. Tuulos, Ville (22 September 2015). "Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances".
  9. Walker, Coral; Alrehamy, Hassan (2015). "Personal Data Lake with Data Gravity Pull". 2015 IEEE Fifth International Conference on Big Data and Cloud Computing. pp. 160–167. doi:10.1109/BDCloud.2015.62. ISBN   978-1-4673-7183-4.
  10. https://www.researchgate.net/publication/283053696_Personal_Data_Lake_With_Data_Gravity_Pull
  11. Needle, David (10 June 2015). "Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques". Enterprise Apps. eWeek. Retrieved 1 November 2015. Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.
  12. "Are Data Lakes Fake News?". Sonra. 8 August 2017. Retrieved 10 August 2017.
  13. "A smarter way to jump into data lakes". McKinsey. 1 August 2017.