Apache Airflow is an
open-source workflow management platform for data engineering pipelines. It started at
Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow
user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level
Apache Software Foundation project in January 2019.
Airflow is written in
Python, and workflows are created via Python scripts. Airflow is designed under the principle of "configuration as code". While other "configuration as code" workflow platforms exist using markup languages like
XML, using Python allows developers to import libraries and classes to help them create their workflows.
Overview
Airflow uses
directed acyclic graphs (DAGs) to manage workflow
orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in
Hive). Previous DAG-based schedulers like
Oozie and Azkaban tended to rely on multiple
configuration files and
file system trees to create a DAG, whereas in Airflow, DAGs can often be written in one Python file.
Managed providers
Three notable providers offer ancillary services around the core open source project.
Astronomerhas built a
SaaS tool and
Kubernetes-deployable Airflow stack that assists with monitoring, alerting, devops, and cluster management.
* Cloud Composer is a managed version of Airflow that runs on
Google Cloud Platform (GCP) and integrates well with other GCP services.
*
Amazon Web Services offers Managed Workflows for Apache Airflow starting from November 2020.
References
External links
*
{{DEFAULTSORT:Airflow
Airflow
Big data products