Skip to content

Airflow

Airflow is an open source platform for building, scheduling, and monitoring workflows such as ELT pipelines.

edu uses Airflow to manage the flow of data from Ed-Fi API(s) and other sources into a data lake, and to run dbt transformations to build the data warehouse.

Background#

Airflow started at Airbnb in 2014 as a tool for managing the company's data workflows. It is currently one of the most popular workflow orchestration tools, and the most mature. Alternatives include Dagster and Prefect, but these are newer tools in the space.

Structure#

Airlow organizes workflows around the idea of a DAG , where various tasks are linked together via dependence relationships. Airflow then handles running the tasks in dependency-order.

The two main components of Airflow are a scheduler which actually runs DAGs and tasks according to the schedules you specify, and a web server which provides a user interface by which to monitor the status of DAGs and tasks.

Technology stack#

DAGs and tasks in Airflow are written in Python code. Airflow also provides an extensive library of convenience functions for connecting to various data sources, invoking command line programs, and more.

Under the hood, Airflow's scheduler stores task metadata in either a local file (via SQLite) or a PostgreSQL backend. A PostgreSQL backend is required for concurrent task execution.

Airflow may also be run with Kubernetes, with the possibilty to run each task in a short-lived container. edu however runs Airflow on a single instance for ease of deployment.

Features#

Best practices#