Lambda Architecture
   HOME

TheInfoList



OR:

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both
batch Batch may refer to: Food and drink * Batch (alcohol), an alcoholic fruit beverage * Batch loaf, a type of bread popular in Ireland * A dialect term for a bread roll used in North Warwickshire, Nuneaton and Coventry, as well as on the Wirra ...
and stream-processing methods. This approach to architecture attempts to balance latency,
throughput Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network. The data that these messages contain may be delivered ov ...
, and
fault-tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
, real-time analytics, and the drive to mitigate the latencies of
map-reduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
. Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record.Bijnens, Nathan
"A real-time architecture using Hadoop and Storm"
11 December 2013.
It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.


Overview

Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries.Marz, Nathan; Warren, James. ''Big Data: Principles and best practices of scalable realtime data systems''. Manning Publications, 2013. The processing layers ingest from an immutable master copy of the entire data set. This paradigm was first described by Nathan Marz in a blog post titled "How to beat the
CAP theorem In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that any distributed data store can provide only two of the following three guarantees:Seth Gilbert and Nancy Lynch"Brewer' ...
" in which he originally termed it the "batch/realtime architecture".Marz, Nathan
"How to beat the CAP theorem"
13 October 2011.


Batch layer

The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process ''all'' available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views. By 2014,
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
was estimated to be a leading batch-processing system. Later, other, relational databases like
Snowflake A snowflake is a single ice crystal that has achieved a sufficient size, and may have amalgamated with others, which falls through the Earth's atmosphere as snow.Knight, C.; Knight, N. (1973). Snow crystals. Scientific American, vol. 228, no. ...
, Redshift, Synapse and Big Query were also used in this role.


Speed layer

The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available. Stream-processing technologies typically used in this layer include
Apache Kafka Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency plat ...
, Amazon Kinesis,
Apache Storm Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. ...
,
SQLstream SQLstream is a distributed, SQL standards-compliant plus Java stream processing platform. SQLstream, Inc. is based in San Francisco, California and was launched in 2009 by Damian Black, Edan Kabatchnik and Julian Hyde, author of the open source Mo ...
,
Apache Samza Apache Samza is an open-source, near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java. It has been developed in conjunction with Apache Kafka. Both were originally ...
,
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
,
Azure Stream Analytics Microsoft Azure Stream Analytics is a serverless scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social m ...
. Output is typically stored on fast NoSQL databases.Kinley, James
"The Lambda architecture: principles for architecting realtime Big Data systems"
retrieved 26 August 2014.
, or as a commit log. Confluen
"Kafka and Events – Key/Value Pairs"
retrieved 06 October 2022.


Serving layer

Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data. Examples of technologies used in the serving layer include
Druid A druid was a member of the high-ranking class in ancient Celtic cultures. Druids were religious leaders as well as legal authorities, adjudicators, lorekeepers, medical professionals and political advisors. Druids left no written accounts. Whi ...
, which provides a single cluster to handle output from both layers.Yang, Fangjin, and Merlino, Gian
"Real-time Analytics with Open Source Technologies"
30 July 2014.
Dedicated stores used in the serving layer include
Apache Cassandra Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandr ...
,
Apache HBase HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Syst ...
, Azure Cosmos DB,
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
,
VoltDB Volt Active Data (formerly VoltDB) is an in-memory database designed by Michael Stonebraker, Sam Madden, and Daniel Abadi. It is an ACID-compliant RDBMS that uses a shared-nothing architecture, and is derived from work done by Stonebraker on OLTP ...
or
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
for speed-layer output, an
Elephant DB
Apache Impala Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development ...
,
SAP HANA SAP HANA (HochleistungsANalyseAnwendung or High-performance ANalytic Application) is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as the software running a databas ...
or
Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
for batch-layer output.


Optimizations

To optimize the data set and improve query efficiency, various rollup and aggregation techniques are executed on raw data, while estimation techniques are employed to further reduce computation costs. And while expensive full recomputation is required for fault tolerance, incremental computation algorithms may be selectively added to increase efficiency, and techniques such as ''partial computation'' and resource-usage optimizations can effectively help lower latency.


Lambda architecture in use

Metamarkets, which provides analytics for companies in the programmatic advertising space, employs a version of the lambda architecture that uses
Druid A druid was a member of the high-ranking class in ancient Celtic cultures. Druids were religious leaders as well as legal authorities, adjudicators, lorekeepers, medical professionals and political advisors. Druids left no written accounts. Whi ...
for storing and serving both the streamed and batch-processed data. For running analytics on its advertising data warehouse,
Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...
has taken a similar approach, also using
Apache Storm Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. ...
,
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
, and
Druid A druid was a member of the high-ranking class in ancient Celtic cultures. Druids were religious leaders as well as legal authorities, adjudicators, lorekeepers, medical professionals and political advisors. Druids left no written accounts. Whi ...
.Rao, Supreeth; Gupta, Sunil
"Interactive Analytics in Human Time"
17 June 2014
The
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not necessarily to provide the same type of views.Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir
"Announcing Suro: Backbone of Netflix's Data Pipeline"
''
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
'', 9 December 2013
Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.


Criticism and alternatives

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths. Yet attempting to abstract the code bases into a single framework puts many of the specialized tools in the batch and real-time ecosystems out of reach.


Kappa architecture

Jay Kreps introduced the kappa architecture to use a pure streaming approach with a single code base. In a technical discussion over the merits of employing a pure streaming approach, it was noted that using a flexible streaming framework such as
Apache Samza Apache Samza is an open-source, near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java. It has been developed in conjunction with Apache Kafka. Both were originally ...
could provide some of the same benefits as batch processing without the latency. Such a streaming framework could allow for collecting and processing arbitrarily large windows of data, accommodate blocking, and handle state.


See also

*
Event stream processing In computer science, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a programming paradigm which views data streams, or sequences of events in time, as the central input and ou ...
*
AWS Lambda Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide di ...
, a specific service from Amazon Web Services, in which only actual execution time is charged


References

{{Reflist


External links


Repository of Information on Lambda of Architecture
Data processing Big data Data management Free software projects Software architecture