Airflow 2.0 and Why We Are Excited at Databand
Airflow 2.0 is here and we are so excited about the new release! There is a bevy of new features that have solved a lot of infrastructure pain points in Airflow and have introduced really solid foundations for new enhancements going forward.
New features in the Airflow 2.0 release include improvements in how you can author DAGs, the performance of Airflow’s scheduler, and a new REST API for accessing Airflow data. Not to mention the new auto-update feature in Airflow 2.0 UI’s graph view to see how tasks are progressing, a great improvement in frontend usability of the UI.
Our team here at Databand is proud to have contributed and helped with two major features in this release – Decorated Flows and Scheduler Performance. We have a couple earlier blog posts on these features that go into detail, but wanted to provide a quicker overview with the actual 2.0 release.
Scheduler Performance Improvements
For heavy Airflow users, performance issues often appear when you need to run hundreds of workflows with dozens of tasks. For teams with more serious pipeline requirements, these performance limitations present a major impediment to scaling Airflow as the central orchestrator and scheduler.
To help Airflow handle more load on the scheduler, we focused our performance optimizations on the time when the Python interpreter was idle. We were especially interested in the time when the scheduler waits for a response from Airflow’s metadata database. We used event listeners on the SQLAlchemy engine to measure the number and timing of queries performed by functions and other system processes.
Our first scans showed major opportunities for improvement. For example, running the DagFileProcessor process on 200 DAGs with 10 tasks each registered over 1800 queries! It turned out a number of queries were being executed several times. The root of the performance enhancements made came from reducing the number of redundant queries.
Using a DAG file with the following characteristics:
- 200 DAG object in one file
- All DAGs have 10 tasks
- Schedule interval is set to ‘None’ for all DAGs
- Tasks have no dependencies
Testing DagFileProcessor.process_file method we got the following results:
In our conditions there was a 10x improvement in DAG parsing time, a significant speed increase of the scheduler and the overall performance of Airflow. Important note: this doesn’t translate to a 10x speed up in running tasks.
You can learn more about the process in making these performance improvements in our earlier blog post, written in collaboration with our friends and Airflow PMC members, Tomek Urbaszek and Kamil Breguła – https://databand.ai/improving-performance-of-apache-airflow-scheduler/
def prepare_email(**kwargs): ti = kwargs['ti'] raw_json = ti.xcom_pull(task_ids='get_ip') external_ip = json.loads(raw_json)['origin'] ti.xcom_push(key="subject", value=f'Server connected from {external_ip}') ti.xcom_push(key="body", value=f'Seems like today your server executing Airflow is connected from the external IP {external_ip}
') with DAG('send_server_ip', default_args=default_args, schedule_interval=None) as dag: get_ip = SimpleHttpOperator(task_id='get_ip', endpoint='get', method='GET', xcom_push=True) email_info = PythonOperator(task_id="prepare_email", python_callable=prepare_email, xcom_push=True, provide_context=True) send_email = EmailOperator( task_id='send_email', to='[email protected]', subject="{{ ti.xcom_pull(key='subject', task_ids='prepare_email') }}", html_content="{{ ti.xcom_pull(key='body', task_ids='prepare_email') }}" ) get_ip >> email_info >> send_email
Writing DAGs with Decorators
' } with DAG('send_server_ip', default_args=default_args, schedule_interval=None) as dag: get_ip = SimpleHttpOperator(task_id='get_ip', endpoint='get', method='GET', xcom_push=True) email_info = prepare_email(get_ip.output) send_email = EmailOperator( task_id='send_email', to='[email protected]', subject=email_info['subject'], html_content=email_info['body'] )
Highlighting some important differences:
- Writing our DAG tasks as function calls, we can connect operator input/output as you would in any Python script
- The decorator approach provides a more concise, readable way of defining our pipelines
- We effectively saved writing about 40% of the “surrounding code” — allowing the user to focus on writing business logic rather than orchestration code
- Provides reusability, allowing the user to share tasks between different DAGs easily
In the simple comparison example above the differences are not so large in magnitude, but as a pipeline gets larger and more complex with tasks, the readability and management improvements become more and more significant.
Decorated Flows provides a cleaner, more concise and intuitive way of building pipelines with Airflow. We’re excited to see it in Airflow 2.0. Learn more how Decorated Flows work in our earlier blog here – https://databand.ai/streamline-your-pipeline-code-with-functional-dags-in-airflow-2-0/
Overall we are thrilled about the new Airflow 2.0 release and are excited for more teams to try it. We will continue to contribute at Databand and are excited for the new capabilities on the roadmap! Happy pipelining!
To learn more about Databand and how we help data engineers monitor their data pipelines in one unified platform, sign up for a free trial or request a demo!
See Databand in Action.
Contact us for a free trial or to see a demo of the solution in action.