Developer(s) | Backtype, Twitter |
---|---|
Stable release | 2.5.0 / 4 August 2023 [1] |
Repository | Storm Repository |
Written in | Clojure & Java |
Operating system | Cross-platform |
Type | Distributed stream processing |
License | Apache License 2.0 |
Website | storm |
Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz [2] and team at BackType, [3] the project was open sourced after being acquired by Twitter. [4] It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011. [5]
A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end. [6]
Storm became an Apache Top-Level Project in September 2014 [7] and was previously in incubation since September 2013. [8] [9]
Apache Storm is developed under the Apache License, making it available to most companies to use. [10] Git is used for version control and Atlassian JIRA for issue tracking, under the Apache Incubator program.
Version | Release Date |
---|---|
2.5.0 | 4 Aug 2023 |
2.4.0 | 25 March 2022 |
2.3.0 | 27 September 2021 |
2.2.0 | 30 June 2020 |
2.1.0 | 6 September 2019 |
1.2.3 | 18 July 2019 |
2.0.0 | 30 May 2019 |
1.1.4 | 8 January 2019 |
1.2.2 | 4 June 2018 |
1.1.3 | |
1.0.7 | 3 May 2018 |
1.2.1 | 19 February 2018 |
1.2.0 | 15 February 2018 |
1.1.2 | |
1.0.6 | 14 February 2018 |
1.0.5 | 15 September 2017 |
1.1.1 | 1 August 2017 |
1.0.4 | 28 July 2017 |
1.1.0 | 29 Mar 2017 |
1.0.3 | 14 February 2017 |
0.10.2 | 14 September 2016 |
0.9.7 | 7 September 2016 |
1.0.2 | 10 August 2016 |
1.0.1 | 6 May 2016 |
0.10.1 | 5 May 2016 |
1.0.0 | 12 April 2016 |
0.10.0 | 5 November 2015 |
0.9.6 | |
0.9.5 | 4 June 2015 |
0.9.4 | 25 March 2015 |
0.9.3 | 25 November 2014 |
0.9.2 | 25 June 2014 |
0.9.1 | 10 February 2014 |
Historical (non-Apache) Version | Release Date |
0.9.0 | 8 December 2013 |
0.8.2 | 11 January 2013 |
0.8.1 | 6 September 2012 |
0.8.0 | 2 August 2012 |
0.7.0 | 28 February 2012 |
0.6.0 | 15 December 2011 |
0.5.0 | 19 September 2011 |
The Apache Storm cluster comprises following critical components:
Storm is but one of dozens of stream processing engines, for a more complete list see Stream processing. Twitter announced Heron on June 2, 2015 [13] which is API compatible with Storm. There are other comparable streaming data engines such as Spark Streaming and Flink. [14]
Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language, hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc.
Ganglia is a scalable, distributed monitoring tool for high-performance computing systems, clusters and networks. The software is used to view either live or recorded statistics covering metrics such as CPU load averages or network utilization for many nodes.
In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
An object-based spatial database is a spatial database that stores the location as objects. The object-based spatial model treats the world as surface littered with recognizable objects, which exist independent of their locations.
Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.
Dryad was a research project at Microsoft Research for a general purpose runtime for execution of data parallel applications. The research prototypes of the Dryad and DryadLINQ data-parallel processing frameworks are available in source form at GitHub.
Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented.
Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.
Apache Hama is a distributed computing framework based on bulk synchronous parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. Originally a sub-project of Hadoop, it became an Apache Software Foundation top level project in 2012. It was created by Edward J. Yoon, who named it, and Hama also means hippopotamus in Yoon's native Korean language (하마), following the trend of naming Apache projects after animals and zoology. Hama was inspired by Google's Pregel large-scale graph computing framework described in 2010. When executing graph algorithms, Hama showed a fifty-fold performance increase relative to Hadoop.
Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Apache Samza is an open-source, near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java. It has been developed in conjunction with Apache Kafka. Both were originally developed by LinkedIn.
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.
Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.
SNAMP is an open-source, cross-platform software platform for telemetry, tracing and elasticity management of distributed applications.
StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.