Apache Airflow

Last updated
Apache Airflow
Original author(s) Maxime Beauchemin / Airbnb
Developer(s) Apache Software Foundation
Initial release June 3, 2015;8 years ago (2015-06-03)
Stable release 2.8.2 [1]   OOjs UI icon edit-ltr-progressive.svg (26 February 2024;2 months ago (26 February 2024)) [±]
Repository
Written in Python
Operating system Windows, macOS, Linux
Type Workflow management platform
License Apache License 2.0
Website airflow.apache.org

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 [2] as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. [3] [4] From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

Contents

Airflow is written in Python, and workflows are created via Python scripts. Airflow is designed under the principle of "configuration as code". While other "configuration as code" workflow platforms exist using markup languages like XML, using Python allows developers to import libraries and classes to help them create their workflows.

Overview

Airflow [5] uses directed acyclic graphs (DAGs) to manage workflow orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive [6] ). Previous DAG-based schedulers like Oozie and Azkaban tended to rely on multiple configuration files and file system trees to create a DAG, whereas in Airflow, DAGs can often be written in one Python file. [7]

Managed providers

Three notable providers offer ancillary services around the core open source project. Astronomer has built a SaaS tool and Kubernetes-deployable Airflow stack that assists with monitoring, alerting, devops, and cluster management. [8] Cloud Composer is a managed version of Airflow that runs on Google Cloud Platform (GCP) and integrates well with other GCP services. [9] Starting from November 2020, Amazon Web Services offers Managed Workflows for Apache Airflow. [10]

Related Research Articles

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on top of" the resulting platform.

Azure DevOps Server, formerly known as Team Foundation Server (TFS) and Visual Studio Team System (VSTS), is a Microsoft product that provides version control, reporting, requirements management, project management, automated builds, testing and release management capabilities. It covers the entire application lifecycle and enables DevOps capabilities. Azure DevOps can be used as a back-end to numerous integrated development environments (IDEs) but is tailored for Microsoft Visual Studio and Eclipse on all platforms.

Progress Chef is a configuration management tool written in Ruby and Erlang. It uses a pure-Ruby, domain-specific language (DSL) for writing system configuration "recipes". Chef is used to streamline the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms such as Amazon EC2, Google Cloud Platform, Oracle Cloud, OpenStack, IBM Cloud, Microsoft Azure, and Rackspace to automatically provision and configure new machines. Chef contains solutions for both small and large scale systems.

<span class="mw-page-title-main">AppScale</span> American cloud infrastructure software company

AppScale is a software company offering cloud infrastructure software and services to enterprises, government agencies, contractors, and third-party service providers. The company commercially supports one software product, AppScale ATS, a managed hybrid cloud infrastructure software platform that emulates the core AWS APIs. In 2019, the company ended commercial support for its open-source serverless computing platform AppScale GTS, but AppScale GTS source code remains freely available to the open-source community.

<span class="mw-page-title-main">OpenStack</span> Cloud computing software

OpenStack is a free, open standard cloud computing platform. It is mostly deployed as infrastructure-as-a-service (IaaS) in both public and private clouds where virtual servers and other resources are made available to users. The software platform consists of interrelated components that control diverse, multi-vendor hardware pools of processing, storage, and networking resources throughout a data center. Users manage it either through a web-based dashboard, through command-line tools, or through RESTful web services.

<span class="mw-page-title-main">TACTIC (web framework)</span> Web-based, open source workflow platform and digital asset management system

TACTIC is a web-based, open source workflow platform and digital asset management system supported by Southpaw Technology in Toronto, ON. Designed to optimize busy production environments with high volumes of content traffic, TACTIC applies business or workflow logic to combined database and file system management. Using elements of digital asset management, production asset management and workflow management, TACTIC tracks the creation and development of digital assets through production pipelines. TACTIC is available under both commercial and open-source licenses, and also as a hosted cloud service through Amazon Web Services Marketplace.

<span class="mw-page-title-main">Apache OODT</span>

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

<span class="mw-page-title-main">Elasticsearch</span> Search engine

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the (source-available) Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

<span class="mw-page-title-main">OpenShift</span> Cloud computing software

OpenShift is a family of containerization software products developed by Red Hat. Its flagship product is the OpenShift Container Platform — a hybrid cloud platform as a service built around Linux containers orchestrated and managed by Kubernetes on a foundation of Red Hat Enterprise Linux. The family's other products provide this platform through different environments: OKD serves as the community-driven upstream, Several deployment methods are available including self-managed, cloud native under ROSA, ARO and RHOIC on AWS, Azure, and IBM Cloud respectively, OpenShift Online as software as a service, and OpenShift Dedicated as a managed service.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma, et.al. Registration requires a credit card or bank account details.

<span class="mw-page-title-main">Apache Mesos</span> Software to manage computer clusters

Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.

<span class="mw-page-title-main">BOSH (software)</span>

BOSH is an open-source software project that offers a toolchain for release engineering, software deployment and application lifecycle management of large-scale distributed services. The toolchain is made up of a server and a command line tool. BOSH is typically used to package, deploy and manage cloud software. While BOSH was initially developed by VMware in 2010 to deploy Cloud Foundry PaaS, it can be used to deploy other software. BOSH is designed to manage the whole lifecycle of large distributed systems.

MinIO is a High-Performance Object Storage system released under GNU Affero General Public License v3.0. It is API compatible with the Amazon S3 cloud storage service. It is capable of working with unstructured data such as photos, videos, log files, backups, and container images with the maximum supported object size being 50TB.

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, VMs, or physical servers. Serverless computing does not hold resources in volatile memory; computing is rather done in short bursts with the results persisted to storage. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing.

References

  1. Error: Unable to display the reference properly. See the documentation for details.
  2. "Apache Airflow". Apache Airflow. Archived from the original on August 12, 2019. Retrieved September 30, 2019.
  3. Beauchemin, Maxime (June 2, 2015). "Airflow: a workflow management platform". Medium. Archived from the original on August 13, 2019. Retrieved September 30, 2019.
  4. "Airflow". Archived from the original on July 6, 2019. Retrieved September 30, 2019.
  5. "Apache Airflow".
  6. Trencseni, Marton (January 16, 2016). "Airflow review". BytePawn. Archived from the original on February 28, 2019. Retrieved October 1, 2019.
  7. "AirflowProposal". Apache Software Foundation. March 28, 2019. Retrieved October 1, 2019.
  8. Lipp, Cassie (July 13, 2018). "Astronomer is Now the Apache Airflow Company". americaninno. Retrieved September 18, 2019.
  9. "Google launches Cloud Composer, a new workflow automation tool for developers". TechCrunch. Retrieved 2019-09-18.[ permanent dead link ]
  10. "Introducing Amazon Managed Workflows for Apache Airflow (MWAA)". Amazon Web Services. 2020-11-24. Retrieved 2020-12-17.