May 5, 2020 By Databand 6 min read


Comprehensive monitoring in Airflow is hard to achieve—let alone advanced alerting in Airflow. While Airflow has a great UI for monitoring high-level schedules and task statuses, there’s a lot of room for improvement in regards to pipeline visibility. In fact, nearly half of respondents reported that logging, monitoring and alerting on Airflow are top challenges or areas of improvement in the 2020 Airflow user survey. As power users and contributors ourselves, we’re intimately familiar with these challenges.

This is a big problem because Airflow is a standard in the data engineering toolkit. It has become so important that huge organizations like Adobe, Walmart, Twitter, and more, have actively contributed to Airflow’s development. That’s because these organizations have invested record amounts in their data infrastructure and the creation of data products, which means the stakes are higher than ever for data engineers. They have the difficult task of discovering and fixing a data quality problem that reach their consumers.

A data engineer’s primary function is to create trust in their data. As a data engineer, your goal should be creating data architecture that fosters data production that is so accurate and efficient that your Slack and email address become forgotten by your data consumers. The only way you can do that is if you can see issues in your Airflow pipeline before they affect data delivery. You need to proactively solve problems so that your data SLAs aren’t at risk. In this article, we’ll show you how to implement advanced alerting in Airflow using IBM Databand.

Monitoring for Airflow and 20+ other tools on Databand.ai

Databand.ai is a unified data observability platform that helps you identify and prioritize the pipeline issues that impact your data product the most.

The Databand Platform tracks pipeline custom metadata and provides alerting, trend analysis and debugging tools so that you can compare pipeline performance and quickly discover bottlenecks.

Databand.ai can tap into metadata like runtime statuses, logging info and data metrics, and uses that info to build alerts on leading indicators of data health issues. You can proactively maintain the integrity of your data infrastructure and the quality of your data product with Databand—it sends alerts to your email, Slack, Pagerduty, and others, so your DataOps teams can get ahead of any pipeline issues.

Early warning signals on long-running processes

Airflow is often used for orchestrating long-running processes that take hours, days or longer to complete.

Let’s take this example pipeline (also called a “DAG”) of a 3-step Python process:

What makes a simple pipeline like this a long-running process? A lot of factors could be the culprit, including abnormally large volumes of “big” data, complex task logic or database pooling causing waits in a queue.

Poor data quality and late delivery in critical pipelines spell trouble for your data product.  End users such as data scientists, analysts and business users depend on pipelines finishing within a certain window of time.

Being able to receive an early warning signal of pipeline failures or delays is a huge advantage—especially when other organizations lose money in these situations.

As a real-world example, let’s say our simple pipeline above was built by a stock trading company for analyzing market data. If the job starts at midday using prior day trading data and takes 4 hours to complete, it would need to finish around 4 pm to leave time for analysis and next-day strategy planning. If the job is running late, you want that notification as early in the process as possible.

Whereas an alert later in the process could mean losing an entire day of work, getting an alert at 1pm stating that tasks are delayed lets you get started on a resolution now. That early notice gives your team the ability to identify the cause of delay, fix the problem and rerun the job before delivery is due.

Databand.ai can give you an early heads-up so you can intervene and fix the problem fast without losing time and money on wasted processing. With zero changes to your project’s existing code, you can use Databand to create robust alerting that gives you alerts on leading indicators of failures and delays.

Step 1: creating leading indicators of problems

While you can use Databand’s production alerting system to notify on the duration of the overall pipeline, this alert is a lagging indicator of a problem. We want leading indicators, or early warning signs.

A better approach is to use Databand.ai to alert on runtime properties of individual tasks in the pipeline. Databand.ai will monitor tasks as they run and send alerts if there are failures (or some other status change), if a task has an unusual number of retries or if task duration exceeds some threshold. You can also receive an alert if a task does not start at the expected time. Setting these alerts on task runtime properties will give us insights on when data is at risk for late delivery much earlier than traditional monitoring methods.

Step 2: diving into the data

So far, you’ve learned how to set run and task level alerts. With Databand helping you track those metrics, you’ll have a much stronger detection system for pipeline issues.

What are you missing?

To properly support our data consumers, it’s not enough to know that pipelines will complete on time. You also need to know that the data we are delivering is up to quality standards. Can the end user actually work with it?

Luckily, you can use the Databand logging API to report on metrics about our data sets.

Databand can automate this process, but for this example, you’ll be defining which custom metrics are going to be reported whenever your pipeline runs.

When Airflow runs the pipeline code, Databand’s logging API will report metrics to Databand’s tracking system, where we can alert on things like schema changes, data completeness rules or other integrity checks.

In the above screenshot, the logging API is reporting the following metrics from the pipeline:

  • avg score: a custom KPI reported from the create_report task
  • number of columns: schema info about the data set being processed
  • removed duplicates: reported from the dedup_records task
  • replaced NaNs: this metric is reported from the first task in the pipeline, unit_imputation

Databand.ai can alert on any metrics reported to the system, and you are free to create as many as needed.

Step 3: using metrics as progress checks

Now, it’s time to add data visibility into your alerting process.
Using Databand.ai’s advanced alerting and anomaly detection, you can use the same metrics-tracking function for gaining insight into internal task progress. This is particularly useful when you have complex pipelines with lots of tasks and subprocesses, or a few tasks with complex logic within them.

With conditional alerts, we can create intra-task status checks by tracking if a metric fails to report within a running task within a certain timeframe.

In our example, the first task, unit_imputation, reports the number of replaced NaNs in our data set. Looking at historical run trends, you can expect the overall task to complete within 1 hour. Based on where you place the metrics log in our code, the metric is usually reported about 30 minutes after the task starts. You can use the expected behavior to create a conditional alert that gives you great insights into what’s happening inside your process.

Alert Logic:

If duration of unit_imputation task is greater than 45 minutes AND replaced NaNs metric is missing, THEN trigger alert

First, let’s describe the logic behind this alert.
The NaNs metric should be reported about halfway into task runtime, which typically takes 1 hour to fully complete. Your alert adds some buffer to that, saying we want a notification if the task runs for 45 minutes without sending any metric. Missing the metric over this duration of time serves as an early warning that the pipeline is hung up on something that could lead to further delays downstream. Since the alert is set up in the first task of the pipeline, you have enough time to check out the issue, and restart the process after a fix.

And what was required to set this up? Only a single use of the logging API.

Databand’s alerting framework configures the rest in minutes.

Airflow alerting for staying proactive, not reactive

After following the provided logic, your alerting coverage includes:

  • Overall run durations
  • Task durations
  • Data quality metrics
  • Internal task progress

You’re now in a much better position to anticipate failures, delays and problems in data quality before they affect your data product.

See how Databand brings all your Airflow observability activities together to simplify and centralize your Apache Airflow observability. If you’re ready to take a deeper look, book a demo today.

Was this article helpful?
YesNo

More from Databand

IBM Databand achieves Snowflake Ready Technology Validation 

< 1 min read - Today we’re excited to announce that IBM Databand® has been approved by Snowflake (link resides outside ibm.com), the Data Cloud company, as a Snowflake Ready Technology Validation partner. This recognition confirms that the company’s Snowflake integrations adhere to the platform’s best practices around performance, reliability and security.  “This is a huge step forward in our Snowflake partnership,” said David Blanch, Head of Product for IBM Databand. “Our customers constantly ask for data observability across their data architecture, from data orchestration…

Introducing Data Observability for Azure Data Factory (ADF)

< 1 min read - In this IBM Databand product update, we’re excited to announce our new support data observability for Azure Data Factory (ADF). Customers using ADF as their data pipeline orchestration and data transformation tool can now leverage Databand’s observability and incident management capabilities to ensure the reliability and quality of their data. Why use Databand with ADF? End-to-end pipeline monitoring: collect metadata, metrics, and logs from all dependent systems. Trend analysis: build historical trends to proactively detect anomalies and alert on potential…

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

4 min read - What are DataOps tools? DataOps, short for data operations, is an emerging discipline that focuses on improving the collaboration, integration and automation of data processes across an organization. DataOps tools are software solutions designed to simplify and streamline the various aspects of data management and analytics, such as data ingestion, data transformation, data quality management, data cataloging and data orchestration. These tools help organizations implement DataOps practices by providing a unified platform for data teams to collaborate, share and manage…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters