Everyday Data Engineering: Databand’s blog series – A guide on common data engineering activities, written for data engineers by data engineers.
Monitoring Airflow production pipelines can be really painful. In order to debug things or find the root cause of failing, a data engineer needs to hop between the Apache Airflow UI, DAG logs, various monitoring tools and Python code. To make this process smoother, we can use operational dashboards to get a bird’s-eye view of our system, clusters, and overall health.
In this guide, we have gathered some best practices for you to easily answer questions such as the following: Is our cluster alive? How many DAGs do we have in a bag? Which operators succeeded and which failed lately? How many tasks are running right now? How long did it take for the DAG to complete? (and so on….)
Data pipeline monitoring is vital for data engineering day to day operations. Data can easily break and data monitoring must be done regularly in order for data to be trusted.
Architecture of Airflow Monitoring Solution
In this article we will focus on monitoring and visualizing Airflow cluster metrics. These types of metrics are great indicators of cluster and infrastructure health and should be constantly monitored.
Airflow exposes metrics such as DAG bag size, number of currently running tasks, and task duration time, every moment the cluster is running. Airflow uses the StatsD format to expose these metrics and we will use this in our solution (we’ll get into more details about this below). You can find a list of all the different metrics exposed, along with descriptions, in the official Airflow documentation.
Let’s see how we can utilize various open-source tools like StatsD, Prometheus and Grafana to build a comprehensive monitoring solution for Airflow. By looking at such a dashboard with this trio, you can find out whenever the scheduler is running, how many DAGs are in a bag now, or any critical problem in the cluster’s health.
We’ll start with StatsD. StatsD is a widely used service for collecting and aggregating metrics from various sources. Airflow has built-in support for sending metrics into the StatsD server. Once configured, Airflow will then push metrics to the StatsD server and we will be able to visualize them.
Prometheus is a popular solution for storing metrics and alerting. Because it is typically used to collect metrics from other sources, like RDBMSes and webservers, we will use Prometheus as the main storage for our metrics. Because Airflow doesn’t have an integration with Prometheus, we’ll use Prometheus StatsD Exporter to collect metrics and transform them into a Prometheus-readable format. StatsD Exporter acts as a regular StatsD server, and Airflow won’t notice any difference between them.
Grafana is our preferred metrics visualization tool. It has native Prometheus support and we will use it to set up our Airflow Cluster Monitoring Dashboard.
The basic architecture of our monitoring solution will look like this:
Airflow Cluster reports metrics to StatsD Exporter which performs transformations and aggregations and passes them to Prometheus. Grafana then queries Prompetheus and displays everything in a gorgeous dashboard. In order for this to happen, we will need to set up all of those pieces.
First, we will configure Airflow, then StatsD Exporter and then Grafana.
Prometheus, StatsD Exporter and metrics mapping
This guide assumes you’re already familiar with Prometheus. If you aren’t yet, it has great documentation and is really easy to get started with as Prometheus doesn’t require special configuration. StasD Exporter will be used to receive metrics and provide them to Prometheus. Usually it doesn’t require much configuration, but because of the way Airflow sends metrics, we will need to re-map them.
By default, Airflow exposes a lot of metrics, which labels are composed from DAG names. For convenience, these metrics should be properly mapped. By utilizing mapping we can then build Grafana dashboards with per-airflow instances and per-dag views.
Let’s take dag.<dag_id>.<task_id>.duration metric for example. The raw metric name sent by Airflow will look like airflow_dag_sample_dag_dummy_task_duration. For each Airflow instance you have and for each DAG you have, it will report duration for each Task producing combinatorics explosion of the metrics. For simple DAGs, it’s not an issue. But when tasks add up, things start being more complicated and you wouldn’t want to bother with Grafana configuration.
To solve this, StatsD Exporter provides a built-in relabeling configuration. There is great documentation and examples of these on the StatsD Exporter page.
Now let’s apply this to our DAG duration metric.
The relabel config will look like this:
mappings: - match: "*.dag.*.*.duration" match_metric_type: observer name: "af_agg_dag_task_duration" labels: airflow_id: "$1" dag_id: "$2" task_id: "$3"
We are extracting three labels from this metric:
- Airflow instance ID (which should be different across the instances)
- DAG ID
- Task ID
Prometheus will then take these labels and we’ll be able to configure dashboards with instance/DAG/task selectors and provide observability on different detalization levels.
We will repeat re-labeling config for each metric exposed by Airflow. See “Source Code” section at the end of the article for a complete example.
A couple of options should be added to airflow.cfg. Please note that Airflow will fail to start if StatsD server won’t be available at the start-up time! Make sure you have an up and running StatsD Exporter instance.
The very basic config section in airflow.cfg will look like this:
[metrics] statsd_on = True statsd_host = localhost statsd_port = 8125 statsd_prefix = airflow
For the more details, please refer to Airflow Metrics documentation.
Configuring Grafana Dashboards
Now, when we have all our metrics properly mapped, we can proceed to creating the dashboards. We will have two dashboards—one for cluster overview and another for DAG metrics.
For the first dashboard we will have the Airflow instance selector:
You can see here all vital metrics: like scheduler heartbeat, dagbag size, queued/running tasks count, currently running DAGs aggregated by tasks etc:
For the second dashboard we will have the DAG selector:
You can see DAG-related metrics: success DAG run duration, failed DAG run duration, DAG run dependency check time and DAG run schedule delay.
Complete source code including StatsD mapping config, two Grafana dashboards—one for Cluster overview and another for DAG stats, can be found in our GitHub project: https://github.com/databand-ai/airflow-dashboards/.
Airflow provides a good out-of-the-box solution for monitoring DAGs and executions. In this tutorial we have configured Airflow, StatsD Exporter and Grafana to get nice and useful dashboards. Dashboards like these can save a lot of time during troubleshooting cluster health issues like executors being down or DAG parsing being stuck because it has some heavyweight DB query. For more robust and convenient monitoring, alerts should also be configured, but this is out of the scope of the current article.
Stay tuned for more guides!
Happy engineering! 🙂
– This guide was written by Vova Rozhkov, Databand’s experienced Software Engineer. To learn more about Databand.ai and how our platform instantly helps data engineers trust their data and gain visibility within their data pipelines, request a demo or sign up for a free trial!