Apache Storm

Last updated

Apache Storm
Developer(s) Backtype, Twitter
Stable release
2.5.0 / 4 August 2023;5 months ago (2023-08-04) [1]
Repository Storm Repository
Written in Clojure & Java
Operating system Cross-platform
Type Distributed stream processing
License Apache License 2.0
Website storm.apache.org

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz [2] and team at BackType, [3] the project was open sourced after being acquired by Twitter. [4] It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011. [5]

Contents

A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end. [6]

Storm became an Apache Top-Level Project in September 2014 [7] and was previously in incubation since September 2013. [8] [9]

Development

Apache Storm is developed under the Apache License, making it available to most companies to use. [10] Git is used for version control and Atlassian JIRA for issue tracking, under the Apache Incubator program.

Major Releases [11]
VersionRelease Date
2.5.04 Aug 2023
2.4.025 March 2022
2.3.027 September 2021
2.2.030 June 2020
2.1.06 September 2019
1.2.318 July 2019
2.0.030 May 2019
1.1.48 January 2019
1.2.24 June 2018
1.1.3
1.0.73 May 2018
1.2.119 February 2018
1.2.015 February 2018
1.1.2
1.0.614 February 2018
1.0.515 September 2017
1.1.11 August 2017
1.0.428 July 2017
1.1.029 Mar 2017
1.0.314 February 2017
0.10.214 September 2016
0.9.77 September 2016
1.0.210 August 2016
1.0.16 May 2016
0.10.15 May 2016
1.0.012 April 2016
0.10.05 November 2015
0.9.6
0.9.54 June 2015
0.9.425 March 2015
0.9.325 November 2014
0.9.225 June 2014
0.9.110 February 2014
Historical (non-Apache) VersionRelease Date
0.9.08 December 2013
0.8.211 January 2013
0.8.16 September 2012
0.8.02 August 2012
0.7.028 February 2012
0.6.015 December 2011
0.5.019 September 2011

Apache Storm architecture

The Apache Storm cluster comprises following critical components:

Peer platforms

Storm is but one of dozens of stream processing engines, for a more complete list see Stream processing. Twitter announced Heron on June 2, 2015 [13] which is API compatible with Storm. There are other comparable streaming data engines such as Spark Streaming and Flink. [14]

See also

Related Research Articles

Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language, hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc.

<span class="mw-page-title-main">Ganglia (software)</span> Distributed monitoring tool software

Ganglia is a scalable, distributed monitoring tool for high-performance computing systems, clusters and networks. The software is used to view either live or recorded statistics covering metrics such as CPU load averages or network utilization for many nodes.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

An object-based spatial database is a spatial database that stores the location as objects. The object-based spatial model treats the world as surface littered with recognizable objects, which exist independent of their locations.

<span class="mw-page-title-main">Apache Solr</span> Open-source enterprise-search platform

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

Dryad was a research project at Microsoft Research for a general purpose runtime for execution of data parallel applications. The research prototypes of the Dryad and DryadLINQ data-parallel processing frameworks are available in source form at GitHub.

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented.

<span class="mw-page-title-main">Apache ZooKeeper</span> System for distributed coordination

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.

<span class="mw-page-title-main">Apache Hama</span>

Apache Hama is a distributed computing framework based on bulk synchronous parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. Originally a sub-project of Hadoop, it became an Apache Software Foundation top level project in 2012. It was created by Edward J. Yoon, who named it, and Hama also means hippopotamus in Yoon's native Korean language (하마), following the trend of naming Apache projects after animals and zoology. Hama was inspired by Google's Pregel large-scale graph computing framework described in 2010. When executing graph algorithms, Hama showed a fifty-fold performance increase relative to Hadoop.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Samza</span> Open-source distributed stream processing

Apache Samza is an open-source, near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java. It has been developed in conjunction with Apache Kafka. Both were originally developed by LinkedIn.

<span class="mw-page-title-main">Lambda architecture</span> Data-processing architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

<span class="mw-page-title-main">Apache Flink</span> Framework and distributed processing engine

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

<span class="mw-page-title-main">SNAMP</span>

SNAMP is an open-source, cross-platform software platform for telemetry, tracing and elasticity management of distributed applications.

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

References

  1. "Apache Storm 2.5.0 Released" . Retrieved 4 August 2023.
  2. Marz, Nathan. "About Nathan Marz". Nathan Marz. Retrieved 28 March 2013.
  3. "BackType Website (defunct)". BackType. Retrieved 28 March 2013.
  4. "A Storm is coming: more details and plans for release". Engineering Blog. Twitter Inc. Retrieved 29 July 2015.
  5. "Storm Codebase". Github. Retrieved 8 February 2013.
  6. "Tutorial - Components of a Storm cluster". Documentation. Apache Storm. Retrieved 29 July 2015.
  7. "Apache Storm Graduates to a Top-Level Project".
  8. "Storm Project Incubation Status". Apache Software Foundation. Retrieved 29 October 2013.
  9. "Storm Proposal". Apache Software Foundation. Retrieved 29 October 2013.
  10. "Powered By Storm". Documentation. Apache Storm. Retrieved 29 July 2015.
  11. "Apache Storm". storm.apache.org. Retrieved 18 August 2017.
  12. "STREAM PROCESSING BIG DATA PROCESSING" (PDF).
  13. "Flying faster with Twitter Heron". Engineering Blog. Twitter Inc. Retrieved 3 June 2015.
  14. Chintapalli, Sanket; Dagit, Derek; Evans, Bobby; Farivar, Reza; Graves, Thomas; Holderbaugh, Mark; Liu, Zhuo; Nusbaum, Kyle; Patil, Kishorkumar; Peng, Boyang Jerry; Poulosky, Paul (May 2016). "Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming". 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. pp. 1789–1792. doi:10.1109/IPDPSW.2016.138. ISBN   978-1-5090-3682-0. S2CID   2180634.