Airflow 2.0 and Why We Are Excited at Databand

2020-12-30 16:25:08

Airflow 2.0 is here and we are so excited about the new release! There is a bevy of new features that have solved a lot of infrastructure pain points in Airflow and have introduced really solid foundations for new enhancements going forward.

New features in the Airflow 2.0 release include improvements in how you can author DAGs, the performance of Airflow’s scheduler, and a new REST API for accessing Airflow data. Not to mention the new auto-update feature in Airflow 2.0 UI’s graph view to see how tasks are progressing, a great improvement in frontend usability of the UI.

Our team here at Databand is proud to have contributed and helped with two major features in this release – Decorated Flows and Scheduler Performance. We have a couple earlier blog posts on these features that go into detail, but wanted to provide a quicker overview with the actual 2.0 release.

Scheduler Performance Improvements

For heavy Airflow users, performance issues often appear when you need to run hundreds of workflows with dozens of tasks. For teams with more serious pipeline requirements, these performance limitations present a major impediment to scaling Airflow as the central orchestrator and scheduler.

To help Airflow handle more load on the scheduler, we focused our performance optimizations on the time when the Python interpreter was idle. We were especially interested in the time when the scheduler waits for a response from Airflow’s metadata database. We used event listeners on the SQLAlchemy engine to measure the number and timing of queries performed by functions and other system processes.

Our first scans showed major opportunities for improvement. For example, running the DagFileProcessor process on 200 DAGs with 10 tasks each registered over 1800 queries! It turned out a number of queries were being executed several times. The root of the performance enhancements made came from reducing the number of redundant queries.

Using a DAG file with the following characteristics:

  • 200 DAG object in one file
  • All DAGs have 10 tasks
  • Schedule interval is set to ‘None’ for all DAGs
  • Tasks have no dependencies

Testing DagFileProcessor.process_file method we got the following results:

Apache Airflow 2.0 with Databand

In our conditions there was a 10x improvement in DAG parsing time, a significant speed increase of the scheduler and the overall performance of Airflow. Important note: this doesn’t translate to a 10x speed up in running tasks.

You can learn more about the process in making these performance improvements in our earlier blog post, written in collaboration with our friends and Airflow PMC members, Tomek Urbaszek and Kamil Breguła – 

Decorated Flows (previously called Functional DAGs)

In our experience, one of the key areas to improve is how you actually compose DAGs in Airflow. One of Airflow’s main benefits is how flexible it is to define pipelines. But with that flexibility also comes overhead and introducing a lot of scaffolding for your code, especially with complex pipelines that have long chains of tasks.

Specifically, Airflow does not have an easy enough way to pass messages between tasks/operators, so building large pipelines requires a lot of hardcoded definitions in how operators communicate. The current preferred approach is using XCom.

But XCom is not intuitive for every Python developer. Constructing an XCom hierarchy can be difficult to figure out and error-prone, most problematic of all with code readability. Airflow 2.0 provides a new way of writing DAGs using a syntax that feels much closer to the standard way of writing functional Python, using Python decorators. You can now write DAGs by simply annotating your Python functions with a decorator to name it as an independent task, and connect these tasks together easily using their parameters.

Writing DAGs with XCom

def prepare_email(**kwargs): ti = kwargs['ti'] raw_json = ti.xcom_pull(task_ids='get_ip') external_ip = json.loads(raw_json)['origin'] ti.xcom_push(key="subject", value=f'Server connected from {external_ip}') ti.xcom_push(key="body", value=f'Seems like today your server executing Airflow is connected from the external IP {external_ip}') with DAG('send_server_ip', default_args=default_args, schedule_interval=None) as dag: get_ip = SimpleHttpOperator(task_id='get_ip', endpoint='get', method='GET', xcom_push=True) email_info = PythonOperator(task_id="prepare_email", python_callable=prepare_email, xcom_push=True, provide_context=True) send_email = EmailOperator( task_id='send_email', to='[email protected]', subject="{{ ti.xcom_pull(key='subject', task_ids='prepare_email') }}", html_content="{{ ti.xcom_pull(key='body', task_ids='prepare_email') }}" ) get_ip >> email_info >> send_email

Writing DAGs with Decorators

@task(multiple_outputs=True) def prepare_email(raw_json: str) -> Dict[str, str]: external_ip = json.loads(raw_json)['origin'] return { 'subject':f'Server connected from {external_ip}', 'body': f'Seems like today your server executing Airflow is connected from the external IP {external_ip}' } with DAG('send_server_ip', default_args=default_args, schedule_interval=None) as dag: get_ip = SimpleHttpOperator(task_id='get_ip', endpoint='get', method='GET', xcom_push=True) email_info = prepare_email(get_ip.output) send_email = EmailOperator( task_id='send_email', to='[email protected]', subject=email_info['subject'], html_content=email_info['body'] )

Highlighting some important differences:

  • Writing our DAG tasks as function calls, we can connect operator input/output as you would in any Python script
  • The decorator approach provides a more concise, readable way of defining our pipelines
  • We effectively saved writing about 40% of the “surrounding code” — allowing the user to focus on writing business logic rather than orchestration code
  • Provides reusability, allowing the user to share tasks between different DAGs easily

In the simple comparison example above the differences are not so large in magnitude, but as a pipeline gets larger and more complex with tasks, the readability and management improvements become more and more significant.

Decorated Flows provides a cleaner, more concise and intuitive way of building pipelines with Airflow. We’re excited to see it in Airflow 2.0. Learn more how Decorated Flows work in our earlier blog here – 

Overall we are thrilled about the new Airflow 2.0 release and are excited for more teams to try it. We will continue to contribute at Databand and are excited for the new capabilities on the roadmap! Happy pipelining!

To learn more about Databand and how we help data engineers monitor their data pipelines in one unified platform, sign up for a free trial or request a demo!

Databand Dataops Observability Platform for Pipelines

Data Replication from PostgreSQL to Snowflake with Python

Read next blog