Apache Airflow Monitoring: Best Practices & Beyond

Apache Airflow allows data engineers to programmatically author, schedule, and orchestrate long-running tasks. Best of all, it’s open-source and constantly being improved by the community. But you already knew that.

The goal of this resource is to give you the best practices, strategies, know-how, and tools you’ll need to set up reliable monitoring and observability around Apache Airflow.

You can revisit this table of contents to jump to relevant sections whenever needed. Now, let’s get into it.

The basics of monitoring in Airflow

Monitoring plays a crucial role in data management. It ensures that your systems and processes are performing as expected. In the case of data monitoring, you are able to track your pipeline performance to help ensure that data is being delivered in a way that adheres to your governance policies. Monitoring capabilities are really important for companies that use Airflow to orchestrate and schedule their long-running tasks.

Airflow has some built-in monitoring capabilities that can help you do this.

Monitoring UI

Airflow’s native UI lets you visualize your DAG and task statuses. In addition, you can monitor a few native metrics from this UI, but there is a lot of room for improvement (we’ll get into that later). This can help you do some light monitoring and troubleshooting for your DAGs.

DAGs View

The DAGs view provides you with a list of DAGs in your environment and a set of shortcuts to other built-in monitoring capabilities. Here, you can see your DAG names, its owner, recently executed runs & tasks statuses, and some quick actions.

airflow monitoring UI DAGs view

If you have multiple teams working in this environment, you should add tags to the pipeline to make monitoring more manageable. You can add team tags to the DAG object in your Dag file. According to Airflow’s documentation, here’s an example of what that could look like:

dag = DAG( dag_id='example_dag_tag', schedule_interval='0 0 * * *', tags=['example'] )

This filter is then saved as a cookie, and you can add the filter by typing it into the filter input field on the top of the DAG view.

Tree View

The Tree View lets you go a level lower into a specific DAG. You can view how all the tasks are ordered within the DAG and the status for each associated run. This lets you see how Run and Task state over time. The Graph View is a little easier on the eyes, but we generally recommend this view because you can see more than one Run at a time and more quickly identify problems.

airflow monitoring tree view

While this view can be great, it becomes hard to manage when the DAG is complex and there are many different Runs; especially given that the Status colors and borders can be difficult to differentiate.

To make this a little easier, Airflow’s Webserver does allow you to customize the TaskInstance and DagRunState colors. In order to do so, create an airflow_local_setting.py file to put on $PYTHONPATH or inside the $AIRFLOW_HOME/config folder (NOTE: Airflow adds $AIRFLOW_HOME/config on $PYTHONPATH when Airflow is initialized). Once you’ve done that, Airflow’s documentation suggests you add the following contents to the airflow_local_setting.py file:

STATE_COLORS = { "queued": 'darkgray', "running": '#01FF70', "success": '#2ECC40', "failed": 'firebrick', "up_for_retry": 'yellow', "up_for_reschedule": 'turquoise', "upstream_failed": 'orange', "skipped": 'darkorchid', "scheduled": 'tan', }

You can customize the colors however your team prefers. After that, just restart the Webserver to see the changes.

Code View

Going a level deeper, you can also view the user code from Airflow’s UI. While the code for your pipeline technically lives in source control, this view helps you find errors in logic if you have enough context.

Logging

Most Runs within Airflow are scheduled automatically without manual interaction. This makes having a log of what happened during the run super important. Luckily, Airflow has some great built-in logging capabilities. This makes it possible to find the cause of issues in development environments.

All of the logging in Airflow is implemented through Python’s standard logging library.  By default, Airflow logs files from the WebServer, the Scheduler, and the Workers running tasks into a local system file. That means when the user wants to access a log file through the web UI, that action triggers a GET request to retrieve the contents. For cloud deployments, Airflow has handlers contributed by the Community for logging to cloud storage such as AWS, Google Cloud, and Azure.

To access your logs from Airflow’s UI, click on the task you are interested in within the Tree View and click the “View Log” button.

Lineage

Data lineage is pretty young as far as features for Airflow go. That said, a lot of development has recently gone into improved lineage support and making it much easier to use. This feature can help you track the origins of data, what happens to it and where it moves over time. This gives you an audit trail, can help you measure your Airflow’s adherence to your data governance policies, and debugging of your data flows.

This feature is useful when you multiple data tasks reading and writing into storage. The user needs to define the input and output data sources for each task, and a graph is created in Apache Atlas, depicting the relationship between various data sources. That said, it can be clunky to deal with. You can reference Airflow’s documentation to explore how it works. If you don’t want all the headaches, there are some third-party tools (including ours) that make integrating this capability easy.

Challenges for monitoring in Airflow

Apache Airflow is great at doing what it’s built to do: orchestration. Though when it comes to monitoring, it can be hard to manage without some finessing.

The truth is: monitoring Airflow can be cumbersome at times. When things go wrong, you’re suddenly jumping between Airflow’s UI, operational dashboards, python code, and pages of logs (We hope you have more than one monitor to manage all of this). That’s why “logging, monitoring, and alerting” was tied for second as an area for improvement in Airflow’s 2020 user survey. What makes monitoring Airflow so difficult? Well, mainly for three reasons that build onto each other:

#1 – Airflow has no data awareness

Airflow is very familiar with your data pipelines. It knows all about your tasks, their statuses, and how long they take to run. It has awareness around execution. But it doesn’t know anything about the data that is running through your DAGs.

There are plenty of issues that can happen to your data outside of what execution metadata would tell you. What if your data source doesn’t deliver any data for some reason? Airflow would show all green on the Webserver UI, but your data consumer would have stale data in their warehouse. What if data is delivered, but an entire column has missing values? Airflow says everything is all good, but your data consumers have incomplete data. What if data is complete, but an unexpected transformation occurs? This may not cause a task to fail, but inaccurate data will be delivered.

You may be able to set some alerts around Run & Task duration that may help notify that something is up. That said, you wouldn’t have the flexibility you need to cover all of your blind spots and you would still need to spend time diagnosing a root cause. This brings us to our next point…

#2 – Airflow’s current monitoring & alerting capabilities are limited

As we stated earlier: Airflow is great for orchestrating tasks. That’s what it was built to do. Understandably, the community doesn’t put much focus on building out a full-fledged monitoring and observability solution inside Airflow. It falls a little too far outside of the original scope of the project. But it couldn’t be completely naked, so there are some simple monitoring capabilities around the pipeline and tasks themselves.

Airflow provides you a nice high-level overview of your operational metadata like Run & Task states, you can set up some simple alerting around that metadata, and you can fetch logs. While that’s great, it doesn’t give you the context mentioned in the first point. So, you’ll need to build out operational dashboards to visualize metrics over a time series to see how your data changes over time. You’ll need to add data quality listeners (Deequ, Great Expectations, Cluster Policies, Callbacks, etc.) to your DAGs to pull metadata around your datasets. Then, you can create custom alerts once you have metadata and trends to work with. And that brings us to our final point…

#3 – It isn’t easy to integrate to your operational flow

You now have a lot of moving pieces just to monitor your Airflow environments. You have alerts going to email, operational metadata & logs in Airflow UI, and your metrics reporting in separate dashboards. This process might work for you if your Airflow environments are limited in scope, but if you’re working with 100s of DAGs across multiple teams, it’s a problem. You won’t be able to view the health of your Airflow environments through a single pane of glass. Different teams will be using different dashboards, and alerts can go unnoticed that don’t route to your organization’s preferred receiver.

This operational debt means that it won’t be easy for your engineers to catch issues early and prioritize fixes before you miss your data SLAs.

The benefits of an Airflow monitoring dashboard

The ideal Airflow monitoring dashboard would essentially be able to do the converse of the three points mentioned above.

You would have a dashboard where you could see performance metrics and trends for operational and dataset metadata. You would be able to set complex alerts based on these trends so you can get ahead of possible SLA misses. You’d be able to centralize your metadata, metric visualization, logs, and alerts in one place so your monitoring capabilities are extensible and efficient.

What we’re describing here isn’t necessarily a monitoring dashboard anymore. That’s a data observability platform.

Airflow monitoring vs. observability

“Data monitoring” lets you know the current state of your data pipeline or your data. It tells you whether the data is complete, accurate, and fresh. It tells you whether your pipelines have succeeded or failed. Data monitoring can show you if things are working or broken, but it doesn’t give you much context outside of that. You know what goes in. You know what comes out. But what happens in between? And why the discrepancy? That’s Airflow monitoring.

Data observability, on the other hand, is a blanket term for monitoring and improving the health of data within applications and systems like data pipelines. Data observability includes all of these activities:

  • Monitoring—a dashboard that provides an operational view of your pipeline or system
  • Alerting—both for expected events and anomalies
  • Tracking—ability to set and track specific events
  • Comparisons—monitoring over time, with alerts for anomalies
  • Analysis—automated issue detection that adapts to your pipeline and data health
  • Next best action—recommended actions to fix errors

By encompassing not just one activity—monitoring—but rather a basket of activities, a data observability platform is much more useful to engineers. Data observability doesn’t stop at describing the problem. It provides context and suggestions to help solve it. When your data-intensive organization, having this level of visibility into your Airflow environments is critical.

What sets the Databand observability platform apart

Databand gives data engineering organizations a streamlined and extensible data observability platform for their entire data platform, including their Apache Airflow environments.

When you connect Databand to your Airflow instance, Databand collects and tracks metadata from your DAGs and your data lake & data warehouse tables to give you true end-to-end data observability. By collecting all of your metadata on your pipelines and datasets, your team has the context they need to significantly cut down their time-to-detection and time-to-resolve OKRs. More importantly, they can fix issues before SLAs are missed and bad data infects downstream business processes.

Monitoring UI

Databand allows you to easily and clearly visualize your operational metadata in a time series. You can add filters to view metadata from specific projects, sources, or pipelines.

When there is an error, you can home in on the issue from the Run View. Databand gives your insights on the data running through your pipelines and what is the root cause of your data issue is.

Lineage

In the dataset view, you can see what operations are reading and writing into your tables over time. This allows you to see how much data is being moved in and out of the table. If there’s an issue, you can see which pipelines it will affect downstream and their corresponding tables.

Logging

Once you know what the issue is, Databand fetches logs written from Airflow, S3, or where ever the issue occurred, so you can get additional context into the issue.

databand anomaly logs statistics analysis data performance detection

Alerting

Setting up complex alerts in Databand is easy. All you have to do is choose a metric and then define a definition. The metrics in this screenshot offer you much more flexibility than native Airflow alerting, and the out-of-the-box metrics provided are usually enough to get you started. If you need to track additional metrics, you can create custom alerts.

Defining your alert definitions is hard when your data is constantly changing. That’s why we build out robust anomaly detection functionality into Databand. This way, you can reduce the amount of time your engineers spend manually adjusting alerts based on recent performance trends.

Databand’s alerting engine is built on Prometheus+AlertManager which allows us to integrate to Slack, PagerDuty, Opsgenie and any other receiver found in the AlertManager Documentation.

Conclusion

All in all, Apache Airflow is a great tool that even lets you monitor your DAGs on a small scale. When you are using Airflow to orchestrate hundreds and thousands of tasks, it leaves a lot to be desired.

In this guide, we covered some of the basics of Airflow monitoring, best practices for monitoring and observability, and some of the functionality you’ll need to make monitoring Airflow manageable at scale. We hope you found it useful.

Are you interested in seeing how Databand could help you monitor your Airflow environments? Schedule a product demo with the team!

Additional reading & resources

Looking for more? Here are links to some more Airflow & monitoring content!

Apache Airflow monitoring

Apache Airflow best practices