Engineering guide to understand Airflow

Apache Airflow is a framework to orchestrate different types of tasks and distribute them between workers.

Structure

Airflow is designed to be used by very large organizations where there are many data pipelines, so it divides its workings in different processes, the main divisions are:

WebServer - Responsible for the visualization and ui;
Scheduler - Responsible for the scheduling and task distribution;
Metadata Database - Responsible for keeping all the information and states of tasks;
Broker - When using celery executor Redis can be used as a communication form between nodes;
Executors - Used to process the tasks.

Each of this services is run one apart from other.

Dags and Tasks

Airflow uses Directed Acyclic Graph or DAG to define a collection of tasks that you want to run. Each task is written in python like:

    t1 = BashOperator(
        task_id='print_date',
        bash_command='date',
    )

Tasks are defined through operators. There are tree types of operators:

Action operators - PythonOperator, BashOperator - They execute actions;

Transfer operators - Operators that trasfer data from one source to another;

Sensors operators - Operators that wait for a condition to be met.

To group them and create the DAG we use:

    with DAG(
        'tutorial',
        default_args=default_args,
        description='A simple tutorial DAG',
        schedule_interval=timedelta(days=1),
        start_date=datetime(2021, 1, 1),
        catchup=False,
        tags=['example'],
    ) as dag:
        // put your tasks here
        with TaskGroup('test_subgroup') as test_subgroup:
            // put your subgroup tasks here

Inside the DAG, after we coded all the tasks we need to give them order to do that you can use:

    t1 >> [t2, t3] >> t4

meaning that the first task is executed then t2 and t3 are executed at the same level next t4 is executed.

Code examples extracted from Airflow Tutorial

Executors

In default configuration airflow will execute one task after another in the SequentialExecutor mode. One can change its configuration to the LocalExecutor to leverage many process in the same machine or to the CeleryExecutor that can have more than one node/machine with as many processes needed. Below there is a list with the most common airflow executors:

Sequential Executor - 1 task at a time, serialized;
Local Executor - many processes same machine/node;
Celery Executor - many machines/nodes with many processes;
Kubernetes Executor - runs on Kubernetes cluster.

Concurrency configurations

Airflow configure concurrency globally, this configurations can be located at the airflow.cfg file. The main ones are the following:

parallelism - The max number of tasks instances;
dag_concurrency - The number of tasks allowed per DAG;
max_active_runs_per_dat - The max number of runs per DAG.

Metadata database and communications

Airflow comes by default with a sqlite database and this type of db can handle only one write at a time, so if you are going to use concurrency you must change to a different database E.g. PostgreSQL. The main configurations regarding this connection are:

sql_alchemy_conn - db connection;
broker_url - Configuration for broker (redis - When using Celery Exec.);
result_backend - db results connection.

Structure

Dags and Tasks

Executors

Concurrency configurations​

Metadata database and communications

Links

Concurrency configurations