Developer(s) | Apache Software Foundation |
---|---|
Initial release | May 2011 |
Stable release | |
Repository | |
Written in | Java and Scala |
Operating system | Cross-platform |
Type |
|
License | Apache License 2.0 |
Website | flink |
Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. [3] [4] Flink executes arbitrary dataflow programs in a data-parallel and pipelined (hence task parallel) manner. [5] Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. [6] [7] Furthermore, Flink's runtime supports the execution of iterative algorithms natively. [8]
Flink provides a high-throughput, low-latency streaming engine [9] as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics. [10] Programs can be written in Java, Scala, [11] Python, [12] and SQL [13] and are automatically compiled and optimized [14] into dataflow programs that are executed in a cluster or cloud environment. [15]
Flink does not provide its own data-storage system, but provides data-source and sink connectors to systems such as Apache Doris, Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch. [16]
Apache Flink is developed under the Apache License 2.0 [17] by the Apache Flink Community within the Apache Software Foundation. The project is driven by over 25 committers and over 340 contributors.
Apache Flink's dataflow programming model provides event-at-a-time processing on both finite and infinite datasets. At a basic level, Flink programs consist of streams and transformations. “Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.” [18]
Apache Flink includes two core APIs: a DataStream API for bounded or unbounded streams of data and a DataSet API for bounded data sets. Flink also offers a Table API, which is a SQL-like expression language for relational stream and batch processing that can be easily embedded in Flink's DataStream and DataSet APIs. The highest-level language supported by Flink is SQL, which is semantically similar to the Table API and represents programs as SQL query expressions.
Upon execution, Flink programs are mapped to streaming dataflows. [18] Every Flink dataflow starts with one or more sources (a data input, e.g., a message queue or a file system) and ends with one or more sinks (a data output, e.g., a message queue, file system, or database). An arbitrary number of transformations can be performed on the stream. These streams can be arranged as a directed, acyclic dataflow graph, allowing an application to branch and merge dataflows.
Flink offers ready-built source and sink connectors with Apache Kafka, Amazon Kinesis, [19] HDFS, Apache Cassandra, and more. [16]
Flink programs run as a distributed system within a cluster and can be deployed in a standalone mode as well as on YARN, Mesos, Docker-based setups along with other resource management frameworks. [20]
Apache Flink includes a lightweight fault tolerance mechanism based on distributed checkpoints. [10] A checkpoint is an automatic, asynchronous snapshot of the state of an application and the position in a source stream. In the case of a failure, a Flink program with checkpointing enabled will, upon recovery, resume processing from the last completed checkpoint, ensuring that Flink maintains exactly-once state semantics within an application. The checkpointing mechanism exposes hooks for application code to include external systems into the checkpointing mechanism as well (like opening and committing transactions with a database system).
Flink also includes a mechanism called savepoints, which are manually-triggered checkpoints. [21] A user can generate a savepoint, stop a running Flink program, then resume the program from the same application state and position in the stream. Savepoints enable updates to a Flink program or a Flink cluster without losing the application's state . As of Flink 1.2, savepoints also allow to restart an application with a different parallelism—allowing users to adapt to changing workloads.
Flink's DataStream API enables transformations (e.g. filters, aggregations, window functions) on bounded or unbounded streams of data. The DataStream API includes more than 20 different types of transformations and is available in Java and Scala. [22]
A simple example of a stateful stream processing program is an application that emits a word count from a continuous input stream and groups the data in 5-second windows:
importorg.apache.flink.streaming.api.scala._importorg.apache.flink.streaming.api.windowing.time.TimecaseclassWordCount(word:String,count:Int)objectWindowWordCount{defmain(args:Array[String]){valenv=StreamExecutionEnvironment.getExecutionEnvironmentvaltext=env.socketTextStream("localhost",9999)valcounts=text.flatMap{_.toLowerCase.split("\\W+")filter{_.nonEmpty}}.map{WordCount(_,1)}.keyBy("word").timeWindow(Time.seconds(5)).sum("count")counts.printenv.execute("Window Stream WordCount")}}
Apache Beam “provides an advanced unified programming model, allowing (a developer) to implement batch and streaming data processing jobs that can run on any execution engine.” [23] The Apache Flink-on-Beam runner is the most feature-rich according to a capability matrix maintained by the Beam community. [24]
data Artisans, in conjunction with the Apache Flink community, worked closely with the Beam community to develop a Flink runner. [25]
Flink's DataSet API enables transformations (e.g., filters, mapping, joining, grouping) on bounded datasets. The DataSet API includes more than 20 different types of transformations. [26] The API is available in Java, Scala and an experimental Python API. Flink's DataSet API is conceptually similar to the DataStream API.
Flink's Table API is a SQL-like expression language for relational stream and batch processing that can be embedded in Flink's Java and Scala DataSet and DataStream APIs. The Table API and SQL interface operate on a relational Table abstraction. Tables can be created from external data sources or from existing DataStreams and DataSets. The Table API supports relational operators such as selection, aggregation, and joins on Tables.
Tables can also be queried with regular SQL. The Table API and SQL offer equivalent functionality and can be mixed in the same program. When a Table is converted back into a DataSet or DataStream, the logical plan, which was defined by relational operators and SQL queries, is optimized using Apache Calcite and is transformed into a DataSet or DataStream program. [27]
Flink Forward is an annual conference about Apache Flink. The first edition of Flink Forward took place in 2015 in Berlin. The two-day conference had over 250 attendees from 16 countries. Sessions were organized in two tracks with over 30 technical presentations from Flink developers and one additional track with hands-on Flink training.
In 2016, 350 participants joined the conference and over 40 speakers presented technical talks in 3 parallel tracks. On the third day, attendees were invited to participate in hands-on training sessions.
In 2017, the event expands to San Francisco, as well. The conference day is dedicated to technical talks on how Flink is used in the enterprise, Flink system internals, ecosystem integrations with Flink, and the future of the platform. It features keynotes, talks from Flink users in industry and academia, and hands-on training sessions on Apache Flink.
In 2020, following the COVID-19 pandemic, Flink Forward's spring edition which was supposed to be hosted in San Francisco was canceled. Instead, the conference was hosted virtually, starting on April 22 and concluding on April 24, featuring live keynotes, Flink use cases, Apache Flink internals, and other topics on stream processing and real-time analytics. [28]
In 2010, the research project "Stratosphere: Information Management on the Cloud" [29] led by Volker Markl (funded by the German Research Foundation (DFG)) [30] was started as a collaboration of Technische Universität Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-Institut Potsdam. Flink started from a fork of Stratosphere's distributed execution engine and it became an Apache Incubator project in March 2014. [31] In December 2014, Flink was accepted as an Apache top-level project. [32] [33] [34] [35]
Version | Original release date | Latest version | Release date | |
---|---|---|---|---|
0.9 | 2015-06-24 | 0.9.1 | 2015-09-01 | |
0.10 | 2015-11-16 | 0.10.2 | 2016-02-11 | |
1.0 | 2016-03-08 | 1.0.3 | 2016-05-11 | |
1.1 | 2016-08-08 | 1.1.5 | 2017-03-22 | |
1.2 | 2017-02-06 | 1.2.1 | 2017-04-26 | |
1.3 | 2017-06-01 | 1.3.3 | 2018-03-15 | |
1.4 | 2017-12-12 | 1.4.2 | 2018-03-08 | |
1.5 | 2018-05-25 | 1.5.6 | 2018-12-26 | |
1.6 | 2018-08-08 | 1.6.3 | 2018-12-22 | |
1.7 | 2018-11-30 | 1.7.2 | 2019-02-15 | |
1.8 | 2019-04-09 | 1.8.3 | 2019-12-11 | |
1.9 | 2019-08-22 | 1.9.2 | 2020-01-30 | |
1.10 | 2020-02-11 | 1.10.3 | 2021-01-29 | |
1.11 | 2020-07-06 | 1.11.6 | 2021-12-16 | |
1.12 | 2020-12-10 | 1.12.7 | 2021-12-16 | |
1.13 | 2021-05-03 | 1.13.6 | 2022-02-18 | |
1.14 | 2021-09-29 | 1.14.6 | 2022-09-28 | |
1.15 | 2022-05-05 | 1.15.4 | 2023-03-15 | |
1.16 | 2022-10-28 | 1.16.3 | 2023-11-29 | |
1.17 | 2023-03-23 | 1.17.2 | 2023-11-29 | |
1.18 | 2023-10-24 | 1.18.0 | 2023-10-24 | |
1.19 | 2024-03-18 | 1.19.0 | 2024-03-18 | |
Legend: Old version Older version, still maintained Latest version |
Release Dates
Apache Incubator Release Dates
Pre-Apache Stratosphere Release Dates
The 1.14.1, 1.13.4, 1.12.6, 1.11.5 releases, which were supposed to only contain a Log4j upgrade to 2.15.0, were skipped because CVE- 2021-45046 was discovered during the release publication. [36]
In computing, a visual programming language, also known as diagrammatic programming, graphical programming or block coding, is a programming language that lets users create programs by manipulating program elements graphically rather than by specifying them textually. A VPL allows programming with visual expressions, spatial arrangements of text and graphic symbols, used either as elements of syntax or secondary notation. For example, many VPLs are based on the idea of "boxes and arrows", where boxes or other screen objects are treated as entities, connected by arrows, lines or arcs which represent relations. VPLs are generally the basis of Low-code development platforms.
In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share some features of functional languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing. Some authors use the term datastream instead of dataflow to avoid confusion with dataflow computing or dataflow architecture, based on an indeterministic machine paradigm. Dataflow programming was pioneered by Jack Dennis and his graduate students at MIT in the 1960s.
The Web Server Gateway Interface is a simple calling convention for web servers to forward requests to web applications or frameworks written in the Python programming language. The current version of WSGI, version 1.0.1, is specified in Python Enhancement Proposal (PEP) 3333.
In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function.
Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
Akka is a source-available toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang.
Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma et al. Registration requires a credit card or bank account details.
Apache Samza is an open-source, near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java. It has been developed in conjunction with Apache Kafka. Both were originally developed by LinkedIn.
Syncthing is a free and open source peer-to-peer file synchronization application available for Windows, macOS, Linux, Android, Solaris, Darwin, and BSD. It can sync files between devices on a local network, or between remote devices over the Internet. Data security and data safety are built into its design. Version 1.0 was released in January 2019 after five years in beta.
Apache Calcite is an open source framework for building databases and data management systems. It includes a SQL parser, an API for building expressions in relational algebra, and a query planning engine. As a framework, Calcite does not store its own data or metadata, but instead allows external data and metadata to be accessed by means of plug-ins.
Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.
RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit multi-core processors (CPUs), and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree data structure. It is written in C++ and provides official language bindings for C++, C, and Java. Many third-party language bindings exist. RocksDB is free and open-source software, released originally under a BSD 3-clause license. However, in July 2017 the project was migrated to a dual license of both Apache 2.0 and GPLv2 license. This change helped its adoption in Apache Software Foundation's projects after blacklist of the previous BSD+Patents license clause.
Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. Beam Pipelines are defined using one of the provided SDKs and executed in one of the Beam’s supported runners including Apache Flink, Apache Samza, Apache Spark, and Google Cloud Dataflow.
Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.
Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Dataflow provides a fully managed service for executing Apache Beam pipelines, offering features like autoscaling, dynamic work rebalancing, and a managed execution environment.