Lambda architecture

Last updated
Flow of data through the processing and serving layers of a generic lambda architecture Diagram of Lambda Architecture (generic).png
Flow of data through the processing and serving layers of a generic lambda architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce. [1]

Contents

Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. [2] :32 It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.

Overview

Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. [3] :13 The processing layers ingest from an immutable master copy of the entire data set. This paradigm was first described by Nathan Marz in a blog post titled "How to beat the CAP theorem" in which he originally termed it the "batch/realtime architecture". [4]

Batch layer

The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views. [3] :18

By 2014, Apache Hadoop was estimated to be a leading batch-processing system. [5] Later, other, relational databases like Snowflake, Redshift, Synapse and Big Query were also used in this role.

Speed layer

Diagram showing the flow of data through the processing and serving layers of lambda architecture. Example named components are shown. Diagram of Lambda Architecture (named components).png
Diagram showing the flow of data through the processing and serving layers of lambda architecture. Example named components are shown.

The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available. [3] :203

Stream-processing technologies typically used in this layer include Apache Kafka, Amazon Kinesis, Apache Storm, SQLstream, Apache Samza, Apache Spark, Azure Stream Analytics, Apache Flink. Output is typically stored on fast NoSQL databases., [6] [7] or as a commit log. [8]

Serving layer

Diagram showing a lambda architecture with a Druid data store. Diagram of Lambda Architecture (Druid data store).png
Diagram showing a lambda architecture with a Druid data store.

Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.

Examples of technologies used in the serving layer include Apache Druid, Apache Pinot, ClickHouse and Tinybird which provide a single platform to handle output from both layers. [9] Dedicated stores used in the serving layer include Apache Cassandra, Apache HBase, Azure Cosmos DB, MongoDB, VoltDB or Elasticsearch for speed-layer output, and Elephant DB, Apache Impala, SAP HANA or Apache Hive for batch-layer output. [2] :45 [6]

Optimizations

To optimize the data set and improve query efficiency, various rollup and aggregation techniques are executed on raw data, [9] :23 while estimation techniques are employed to further reduce computation costs. [10] And while expensive full recomputation is required for fault tolerance, incremental computation algorithms may be selectively added to increase efficiency, and techniques such as partial computation and resource-usage optimizations can effectively help lower latency. [3] :93,287,293

Lambda architecture in use

Metamarkets, which provides analytics for companies in the programmatic advertising space, employs a version of the lambda architecture that uses Druid for storing and serving both the streamed and batch-processed data. [9] :42

For running analytics on its advertising data warehouse, Yahoo has taken a similar approach, also using Apache Storm, Apache Hadoop, and Druid. [11] :9,16

The Netflix Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not necessarily to provide the same type of views. [12] Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.

Criticism and alternatives

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths. Yet attempting to abstract the code bases into a single framework puts many of the specialized tools in the batch and real-time ecosystems out of reach. [13]

Kappa architecture

Jay Kreps introduced the kappa architecture to use a pure streaming approach with a single code base. [13] In a technical discussion over the merits of employing a pure streaming approach, it was noted that using a flexible streaming framework such as Apache Samza could provide some of the same benefits as batch processing without the latency. [14] Such a streaming framework could allow for collecting and processing arbitrarily large windows of data, accommodate blocking, and handle state.

See also

Related Research Articles

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

<span class="mw-page-title-main">InfiniDB</span> Database management software company based in Frisco, Texas

InfiniDB was a database management software company based in Frisco, Texas. The company developed InfiniDB, a scalable, software-only columnar database management system for analytic applications.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Apache Storm</span> Open-source distributed stream processing

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Cloud analytics is a marketing term for businesses to carry out analysis using cloud computing. It uses a range of analytical tools and techniques to help companies extract information from massive data and present it in a way that is easily categorised and readily available via a web browser.

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Samza</span> Open-source distributed stream processing

Apache Samza is an open-source, near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java. It has been developed in conjunction with Apache Kafka. Both were originally developed by LinkedIn.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Imply Data, Inc. is an American software company. It develops and provides commercial support for the open-source Apache Druid, a real-time database designed to power analytics applications.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

  1. Schuster, Werner. "Nathan Marz on Storm, Immutability in the Lambda Architecture, Clojure". www.infoq.com. Interview with Nathan Marz, 6 April 2014
  2. 1 2 Bijnens, Nathan. "A real-time architecture using Hadoop and Storm". 11 December 2013.
  3. 1 2 3 4 Marz, Nathan; Warren, James. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013.
  4. Marz, Nathan. "How to beat the CAP theorem". 13 October 2011.
  5. Kar, Saroj. "Hadoop Sector will Have Annual Growth of 58% for 2013-2020" Archived 2014-08-26 at archive.today , 28 May 2014. Cloud Times.
  6. 1 2 Kinley, James. "The Lambda architecture: principles for architecting realtime Big Data systems" Archived 2014-09-04 at the Wayback Machine , retrieved 26 August 2014.
  7. Ferrera Bertran, Pere. "Lambda Architecture: A state-of-the-art". 17 January 2014, Datasalt.
  8. Confluent."Kafka and Events – Key/Value Pairs", retrieved 06 October 2022.
  9. 1 2 3 Yang, Fangjin, and Merlino, Gian. "Real-time Analytics with Open Source Technologies". 30 July 2014.
  10. Ray, Nelson. "The Art of Approximating Distributions: Histograms and Quantiles at Scale". 12 September 2013. Metamarkets.
  11. Rao, Supreeth; Gupta, Sunil. "Interactive Analytics in Human Time". 17 June 2014
  12. Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix , 9 December 2013
  13. 1 2 Kreps, Jay. "Questioning the Lambda Architecture". oreilly.com. Oreilly. Retrieved 3 October 2024.
  14. Hacker News retrieved 20 August 2014

[1]

  1. "Lambda vs Kappa Architecture". www.interlinkjobs.com. Retrieved 2024-08-01.