Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

Tools & tips for ensuring the health of your Apache Airflow instances

Databand
2021-12-10 10:40:39

Adopting Airflow often means it becomes the center of the analytics stack: triggering ETLs, running SQL, and training ML models while the schedule defined for each DAG likely comes from downstream SLAs. For instance, the product team keeps a close eye on product performance and responds to any deviations within six hours.

If anything happens with the core Airflow instance, downstream dependencies suffer. Data isn’t updated and dashboards become stale. Especially regarding reverse ETL jobs which are more likely to touch the end user, the downtime becomes increasingly problematic.

To build trust in data, the underlying system needs to be reliable. How do you define success and health regarding Airflow?

Operational vs. dataset monitoring

There are a few components to keep in mind: the data actually running through the system as well as the Airflow infrastructure itself. The data may be updating and there’s no downtime, but it could be fundamentally inaccurate due a coding bug. Alternatively, the instance Airflow is running on could be out of memory so Airflow itself is down. 

All rivers flow to the ocean, just like all failures of a pipeline impact downstream data. Alerting as close to the point of failure as possible allows teams to respond quicker to the issue at hand.

Imagine you receive an alert that the last data point in a table was 12 hours ago. Consider this table is an aggregation of several upstream tables, themselves coming from several third party sources, with both queries and ETLs running on Airflow.

It’s not hard to see multiple points of failure here. Just to name a few:

  • One of the ETLs could be silently failing, or there’s actually no upstream data which would likely be something someone should know about. For instance, what if someone accidentally turned off all paid marketing for a day?
  • A particular task could be taking a lot longer or stuck in a running state, due to not failing out properly.
  • The Airflow instance itself could be out of resources, and not executing any tasks.

A team could spend hours digging into why there’s no data anytime there’s an issue. Instead, the same team could implement checks at every transition point once to understand at what point the data isn’t updating.
These checks can take three forms: observing key metrics for all infrastructure, building a data quality system to check expectations and business logic within the pipeline, and lastly alerting on any of the above when out of the ordinary.

When to rely on observability, data quality, versus alerting

First, let me define what I mean by each of these terms.

Observability is an organization’s ability to easily understand the health of their data pipeline & the quality of the data being moved through those pipelines. Data observability is a blanket term that includes a lot of measures of health, but most notably includes pipeline infrastructure & pipeline metadata tracking.

Data quality involves making sure the data is truly accurate. There are plenty of data quality tools out there to help build extremely granular checks for data.

Alerting is, you guessed it, alerting on issues. Operating in a “silence is good” world is important for productivity and peace of mind. Sending a Slack alert or email when a critical issue arises allows the team to react quickly.

Observability is key to be able to go and quickly understand whether the pipeline execution is in the realm of what’s expected, at every part in the pipeline. However, relying solely on observability for critical issues can result in missed information. After all, humans aren’t always “observing,” they need to sleep eventually.

In the case of data pipelines, observability, data quality, and alerting all contribute to measures of success best when used together.

While the tools mentioned above address data quality, they don’t address issues with the infrastructure actually updating the data. Even if there’s no bug in code, data won’t be fresh if the infrastructure it runs on isn’t healthy. If that’s Airflow, guard rails need to be in place to monitor it in parallel to the data itself.

Measuring the health of Airflow relies on understanding some key concepts: DAGs, being entire processes with subcomponents, and tasks, being the subcomponents themselves. The three key infrastructure pieces of Airflow are:

  • The worker, which actually does the heavy lifting of executing the tasks.
  • The scheduler, which controls which tasks are running, queued, and what’s up next.
  • The webserver, which runs the UI of Airflow. (Although not great for observability, the webserver can come down anytime and tasks would still run)

Let’s talk about possible points of failure when it comes to Airflow. Just to name a few:

  • Airflow runs on a virtual machine that’s out of memory or other resources.
  • Instead of a single virtual machine, Airflow runs on infrastructure that auto-scales (great!) like AWS Elastic Container Service (ECS), but it needs a higher auto-scaling threshold and actually also can be at maximum capacity.
  • The number of tasks allowed to run concurrently across the entire Airflow instance is too low, so tasks are stuck in a queued state.

All the points of failure have one thing in common: they impact data freshness. What they don’t have in common are the responses which would remediate the issue. Getting to the bottom of the single point of failure in Airflow, just like troubleshooting data freshness, will expedite time to address the issue.

Now that we’ve explored Airflow’s moving pieces and the tools you’ll need to ensure their health, let’s dive into what to actually measure in more detail.

Apache Airflow measures of success

Alerting on every single nut and bolt of Airflow wouldn’t be an efficient use of time, and is also likely to cause alerting fatigue. Instead, I recommend focusing on ensuring three key pieces of Airflow are working, indicative of overall health.

Make sure your tasks are actually being scheduled

To ensure data SLAs of downstream processes and dashboards, the Airflow scheduler’s health is crucial. There may be no bug in code, but resource issues with infrastructure that prevent the scheduler from queuing tasks. Observing infrastructure health and sending alerts when resource issues occur early won’t jeopardize SLAs.

Tools like Grafana, StatsD, and Prometheus can be integrated with Airflow; we wrote about it previously here.

As a dashboarding tool, Grafana integrates well with Airflow core concepts. At Databand, we’ve built out dashboards that you can use in your organization with metrics including the number of tasks failed, scheduled, and running.

screenshot of grafana operational dashboard

On the alerting front, alerting on a high number of tasks in a running or queued state would indicate the scheduler isn’t changing task state properly. More specifically, aside from task failures, I recommend using Grafana to write alerts for the number of queued tasks exceeding X and number of running tasks exceeding X. This number (X) will vary depending on the number of enabled DAGs. If you have 100 DAGs, having over 20 queued tasks at once is a good start for an alert. Of course, if all DAGs run on the same schedule, this isn’t quite as indicative and may need to be higher to remain truly indicative of criticality.

Identify the lineage of success

If data quality errors occur, identifying the root cause and impact quickly is a reflection of a data team’s efficiency and organization. Tracking upstream lineage thoroughly will allow for efficient alerting on data input and output outliers.

Thorough data quality testing enables quick understanding. With detailed pipeline tests and lineage between Airflow tasks combined, an issue can be traced to both root cause upstream as well as downstream effects.

In addition to tests, using the appropriate trigger rules in Airflow is key for data quality.

Use the `none_failed` trigger rule for downstream tasks, explained further by Marc Lamberti here. If a task fails, no downstream dependencies should run, which the `none_failed` trigger rule ensures.

If implementing error alerting as a task in Airflow, use the `one_failed` trigger rule to ensure a failure alert is triggered in all cases.

Minimize time between an issue occurring and you knowing about it

Alerts should be set up for critical issues that actually require immediate action. No one can keep up with too many Slack messages. Usually, that results in just muting a channel. Avoid that at all costs.

Sending an alert each time a pipeline is taking 10% longer than usual might not be indicative of an immediate issue, especially if the pipeline succeeds a minute later.

Critical issues, like an Airflow scheduler not queuing tasks, should result in alerts. Non-critical issues, like a pipeline taking 10% longer than usual, should not result in alerts. However, finding that balance is an ongoing process. If a pipeline is taking 40% longer than usual, is that considered critical? Sometimes it can be. Alerting logic should constantly be modified and improved to maximize the percent of alerts that actually need immediate action.

Airflow monitoring: having all of your bases covered

In this article, we covered: 

  • Operational vs. Dataset tracking
  • The different types of monitoring you’ll need: Data observability, data quality, and alerting
  • Airflow’s common points of failure
  • What you should track to measure the success of your Airflow DAGs

The approaches outlined will help you get started monitoring Airflow and notifying your team of any failures. Airflow is complex, and alerts can certainly get more granular as your organization grows.