Apache Airflow

Last updated
Apache Airflow
Original author(s) Maxime Beauchemin / Airbnb
Developer(s) Apache Software Foundation
Initial release June 3, 2015;9 years ago (2015-06-03)
Stable release 2.8.2 [1]   OOjs UI icon edit-ltr-progressive.svg (26 February 2024;5 months ago (26 February 2024)) [±]
Repository
Written in Python
Operating system Windows, macOS, Linux
Type Workflow management platform
License Apache License 2.0
Website airflow.apache.org

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 [2] as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. [3] [4] From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

Contents

Airflow is written in Python, and workflows are created via Python scripts. Airflow is designed under the principle of "configuration as code". While other "configuration as code" workflow platforms exist using markup languages like XML, using Python allows developers to import libraries and classes to help them create their workflows.

Overview

Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive [5] ). Previous DAG-based schedulers like Oozie and Azkaban tended to rely on multiple configuration files and file system trees to create a DAG, whereas in Airflow, DAGs can often be written in one Python file. [6]

Managed providers

Three notable providers offer ancillary services around the core open source project.

Related Research Articles

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

WebSphere Application Server (WAS) is a software product that performs the role of a web application server. More specifically, it is a software framework and middleware that hosts Java-based web applications. It is the flagship product within IBM's WebSphere software suite. It was initially created by Donald F. Ferguson, who later became CTO of Software for Dell. The first version was launched in 1998. This project was an offshoot from IBM HTTP Server team starting with the Domino Go web server.

In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on top of" the resulting platform.


This is a comparison of notable free and open-source configuration management software, suitable for tasks like server configuration, orchestration and infrastructure as code typically performed by a system administrator.

Azure DevOps Server, formerly known as Team Foundation Server (TFS) and Visual Studio Team System (VSTS), is a Microsoft product that provides version control, reporting, requirements management, project management, automated builds, testing and release management capabilities. It covers the entire application lifecycle and enables DevOps capabilities. Azure DevOps can be used as a back-end to numerous integrated development environments (IDEs) but is tailored for Microsoft Visual Studio and Eclipse on all platforms.

<span class="mw-page-title-main">WaveMaker</span> Low-code programming platform

WaveMaker is a Java-based low-code development platform designed for building software applications and platforms. The company, WaveMaker Inc., is based in Mountain View, California. The platform is intended to assist enterprises in speeding up their application development and IT modernization initiatives through low-code capabilities. Additionally, for independent software vendors (ISVs), WaveMaker serves as a customizable low-code component that integrates into their products.

<span class="mw-page-title-main">AppScale</span> American cloud infrastructure software company

AppScale is a software company that offers cloud infrastructure software and services to enterprises, government agencies, contractors, and third-party service providers. The company commercially supports one software product, AppScale ATS, a managed hybrid cloud infrastructure software platform that emulates the core AWS APIs. In 2019, the company ended commercial support for its open-source serverless computing platform AppScale GTS, but AppScale GTS source code remains freely available to the open-source community.

<span class="mw-page-title-main">OpenStack</span> Cloud computing software

OpenStack is a free, open standard cloud computing platform. It is mostly deployed as infrastructure-as-a-service (IaaS) in both public and private clouds where virtual servers and other resources are made available to users. The software platform consists of interrelated components that control diverse, multi-vendor hardware pools of processing, storage, and networking resources throughout a data center. Users manage it either through a web-based dashboard, through command-line tools, or through RESTful web services.

<span class="mw-page-title-main">TACTIC (web framework)</span> Web-based, open source workflow platform and digital asset management system

TACTIC is a web-based, open source workflow platform and digital asset management system supported by Southpaw Technology in Toronto, ON. Designed to optimize busy production environments with high volumes of content traffic, TACTIC applies business or workflow logic to combined database and file system management. Using elements of digital asset management, production asset management and workflow management, TACTIC tracks the creation and development of digital assets through production pipelines. TACTIC is available under both commercial and open-source licenses, and also as a hosted cloud service through Amazon Web Services Marketplace.

<span class="mw-page-title-main">Apache OODT</span>

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

<span class="mw-page-title-main">Elasticsearch</span> Search engine

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the (source-available) Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

<span class="mw-page-title-main">OpenShift</span> Cloud computing software

OpenShift is a family of containerization software products developed by Red Hat. Its flagship product is the OpenShift Container Platform — a hybrid cloud platform as a service built around Linux containers orchestrated and managed by Kubernetes on a foundation of Red Hat Enterprise Linux. The family's other products provide this platform through different environments: OKD serves as the community-driven upstream, Several deployment methods are available including self-managed, cloud native under ROSA, ARO and RHOIC on AWS, Azure, and IBM Cloud respectively, OpenShift Online as software as a service, and OpenShift Dedicated as a managed service.

AWS Elastic Beanstalk is an orchestration service offered by Amazon Web Services for deploying applications which orchestrates various AWS services, including EC2, S3, Simple Notification Service (SNS), CloudWatch, autoscaling, and Elastic Load Balancers. Elastic Beanstalk provides an additional layer of abstraction over the bare server and OS; users instead see a pre-built combination of OS and platform, such as "64bit Amazon Linux 2014.03 v1.1.0 running Ruby 2.0 (Puma)" or "64bit Debian jessie v2.0.7 running Python 3.4 ". Deployment requires a number of components to be defined: an 'application' as a logical container for the project, a 'version' which is a deployable build of the application executable, a 'configuration template' that contains configuration information for both the Beanstalk environment and for the product. Finally an 'environment' combines a 'version' with a 'configuration' and deploys them. Executables themselves are uploaded as archive files to S3 beforehand and the 'version' is just a pointer to this.

Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma et al. Registration requires a credit card or bank account details.

<span class="mw-page-title-main">Apache Mesos</span> Software to manage computer clusters

Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, virtual machines, or physical servers. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing.

References

  1. https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#airflow-2-8-2-2024-02-26.{{cite web}}: Missing or empty |title= (help)
  2. "Apache Airflow". Apache Airflow. Archived from the original on August 12, 2019. Retrieved September 30, 2019.
  3. Beauchemin, Maxime (June 2, 2015). "Airflow: a workflow management platform". Medium. Archived from the original on August 13, 2019. Retrieved September 30, 2019.
  4. "Airflow". Archived from the original on July 6, 2019. Retrieved September 30, 2019.
  5. Trencseni, Marton (January 16, 2016). "Airflow review". BytePawn. Archived from the original on February 28, 2019. Retrieved October 1, 2019.
  6. "AirflowProposal". Apache Software Foundation. March 28, 2019. Retrieved October 1, 2019.
  7. Lipp, Cassie (July 13, 2018). "Astronomer is Now the Apache Airflow Company". americaninno. Retrieved September 18, 2019.
  8. "Google launches Cloud Composer, a new workflow automation tool for developers". TechCrunch. Retrieved 2019-09-18.
  9. "Introducing Amazon Managed Workflows for Apache Airflow (MWAA)". Amazon Web Services. 2020-11-24. Retrieved 2020-12-17.