Developer(s) | Apache Software Foundation |
---|---|
Final release | |
Repository | Apex Repository |
Written in | Java and Scala |
Operating system | Cross-platform |
Type | Stream processing, Batch processing |
License | Apache License 2.0 |
Website | apex |
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable. [3]
Apache Apex was named a top-level project by The Apache Software Foundation on April 25, 2016. [4] [5] As of September 2019, it is no longer actively developed. [2]
Apache Apex is developed under the Apache License 2.0. [6] The project was driven by the San Jose, California-based start-up company DataTorrent.
There are two parts of Apache Apex: Apex Core and Apex Malhar. Apex Core is the platform or framework for building distributed applications on Hadoop. The core Apex platform is supplemented by Malhar, a library of connector and logic functions, enabling rapid application development. These input and output operators provide templates to sources and sinks such as Alluxio, S3, HDFS, NFS, FTP, Kafka, ActiveMQ, RabbitMQ, JMS, Cassandra, MongoDB, Redis, HBase, CouchDB, generic JDBC, and other database connectors.
DataTorrent has developed the platform since 2012 and then decided to open source the core that became Apache Apex. [7] It entered incubation in August 2015 and became Apache Software Foundation top level project within 8 months. DataTorrent itself shut down in May 2018. [8]
As of September 2019, Apache Apex is no longer being developed. [2]
Module | Version | Release date |
---|---|---|
Apex Core | 3.7.0 | 27 April 2018 |
Apex Malhar | 3.8.0 | 11 November 2017 |
Apex Core | 3.6.0 | 5 May 2017 |
Apex Malhar | 3.7.0 | 31 March 2017 |
Apex Core | 3.5.0 | 12 December 2016 |
Apex Malhar | 3.6.0 | 9 December 2016 |
Apex Malhar | 3.5.0 | 3 September 2016 |
Apex Core and Malhar | 3.4.0 | 5 May 2016 |
Apex Big Data World [9] is a conference about Apache Apex. The first conference of Apex Big Data World took place in 2017. They were held in Pune, India and Mountain View, California, USA.
Apache Derby is a relational database management system (RDBMS) developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. It has a 3.5 MB disk-space footprint.
Eclipse Jetty is a Java web server and Java Servlet container. While web servers are usually associated with serving documents to people, Jetty is now often used for machine to machine communications, usually within larger software frameworks. Jetty is developed as a free and open source project as part of the Eclipse Foundation. The web server is used in products such as Apache ActiveMQ, Alfresco, Scalatra, Apache Geronimo, Apache Maven, Apache Spark, Google App Engine, Eclipse, FUSE, iDempiere, Twitter's Streaming API and Zimbra. Jetty is also the server in open source projects such as Lift, Eucalyptus, OpenNMS, Red5, Hadoop and I2P. Jetty supports the latest Java Servlet API as well as protocols HTTP/2 and WebSocket.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
Apache CouchDB is an open-source document-oriented NoSQL database, implemented in Erlang.
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
DataStax, Inc. is a real-time data for AI company based in Santa Clara, California. Its product Astra DB is a cloud database-as-a-service based on Apache Cassandra. DataStax also offers DataStax Enterprise (DSE), an on-premises database built on Apache Cassandra, and Astra Streaming, a messaging and event streaming cloud service based on Apache Pulsar. As of June 2022, the company has roughly 800 customers distributed in over 50 countries.
Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.
Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.
Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.
RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit multi-core processors (CPUs), and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree data structure. It is written in C++ and provides official language bindings for C++, C, and Java. Many third-party language bindings exist. RocksDB is free and open-source software, released originally under a BSD 3-clause license. However, in July 2017 the project was migrated to a dual license of both Apache 2.0 and GPLv2 license. This change helped its adoption in Apache Software Foundation's projects after blacklist of the previous BSD+Patents license clause.
Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.
RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability. It is the third generation distributed messaging middleware open sourced by Alibaba in 2012. On November 21, 2016, Alibaba donated RocketMQ to the Apache Software Foundation. Next year, on February 20, the Apache Software Foundation announced Apache RocketMQ as a Top-Level Project.
TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Designed to be MySQL compatible, it is developed and supported primarily by PingCAP and licensed under Apache 2.0. It is also available as a paid product. TiDB drew its initial design inspiration from Google's Spanner and F1 papers.