As a data engineer, you need to create trust in your data. You need to be aware of problems in pipeline timeliness and data quality so you can proactively solve problems and communicate with your data consumers.
Databand is a DataOps system for tracking pipelines. The product tracks pipeline metadata and provides alerting, trend analysis, and debugging tools. Databand can tap into metadata like runtime statuses, logging info, and data metrics, and use that info to build advanced alerts that help you stay on top of your data infra. Databand alerts can be sent to email, Slack, Pagerduty, and other systems used by Ops teams.
In this post, we’ll describe how we can use Databand for stronger logging, monitoring, and alerting on Apache Airflow — one of our favorite orchestration tools.
What is Airflow?
Airflow is an open-source tool for orchestrating code-based workflows and data processing pipelines. Engineers use Airflow to run batch processes — SQL jobs in Snowflake, big data jobs in tools like Spark, straightforward Python scripts, and more. Airflow has become a standard in the data engineering toolkit, used by early stage startups and Fortune50 enterprises alike.
Airflow has a great UI for monitoring high level schedule and task statuses, but there’s a lot of room for users to improve their pipeline visibility. In the 2019 Airflow user survey, nearly half of respondents reported that logging, monitoring, and alerting on Airflow are top challenges or areas of improvement. As Airflow power users and contributors in Databand ourselves, we’re intimately familiar with these challenges.
Early Warning Signals on Long Running Processes
Airflow is often used for orchestrating long running processes that take hours, days, or longer to complete.
Let’s take this example pipeline (called a “DAG” in Airflow lingo) of a three step Python process:
What makes even a simple pipeline like this a long running process? It could be a lot of factors, including big data, complex task logic, or sometimes database pooling causing waits in a queue.
In most data teams, pipelines like this are expected to finish within a certain window of time so data consumers (data scientists, analysts, business users) have fresh, accurate data.
Having an early warning signal of pipeline failures or delays is a huge value-add. With an early heads up, you can proactively intervene, fix the problem, and save precious time of wasted processing.
As a real world example, let’s say our simple pipeline above was built by a stock trading company for analyzing market data. If the job starts at midday using prior day trading data and takes 4 hours to complete, it would need to finish around 4pm to leave time for analysis and next-day strategy planning. If the job is running late, you want that notification as early in the process as possible. Getting an alert at 1pm that tasks are delayed will give more time to intervene, fix the problem, and rerun the job, whereas an alert later in the process could mean losing an entire day of work.
With zero changes to our project code, we can use Databand to create robust alerting that gives us early indicators of failures and delays.
First off, we can use Databand’s production alerting system to notify on the duration of the overall pipeline. If it tends to complete in 4 hours, we’ll get an alert when that condition is not met. While a good place to start, the problem with this alert is that it’s a lagging indicator of a problem. We want leading indicators — early warning signs.
Step 1 — Creating Leading Indicators of Problems
A better approach is to use Databand to alert on runtime properties of individual tasks in the pipeline. Databand will monitor tasks as they run and fire notifications if there are failures (or some other status change), if a task has an unusual number of retries, or if task duration exceeds some threshold. We can also alert if a task does not start at the expected time. Setting these alerts on task runtime properties will give us much better early warning signals of when the pipeline is behaving badly.
Step 2 — Diving into the Data
So far we’ve covered run and task level alerts. Between the two, we’ll have a much stronger detection system for pipeline issues.
What are we missing? To properly support our data consumers, it’s not enough to know that pipelines will complete on time. We also need to know that the data we are delivering is up to quality standards and they can actually work with it.
We can use the Databand logging API to report on metrics about our data sets. Databand provides automated approaches to this, but the most explicit example would be defining custom metrics that are reported whenever our pipeline runs. When Airflow runs the pipeline code, Databand’s logging API will report metrics to Databand’s tracking system, where we can alert on things like schema changes, data completeness rules, or other integrity checks.
In the above screenshot, the logging API is reporting the following metrics from the pipeline:
- avg score: a custom KPI reported from the create_report task
- number of columns: schema info about the data set being processed
- removed duplicates: reported from the dedup_records task
- replaced NaNs: this metric is reported from the first task in the pipeline, unit_imputation
Databand can alert on any metrics reported to the system, and users are free to create as many as needed.
Step 3 — Using Metrics as Progress Checks
So we can now add data visibility into our alerting process. But we can go even further than that! Using Databand’s advanced alerting features, we can use the same metrics tracking mechanism for gaining insight into internal task progress. This is particularly useful when you have complex pipelines with lots of tasks and subprocesses, or a few tasks with complex logic inside them.
With conditional alerts, we can create intra-task status checks by tracking if a metric fails to report within a running task in a certain period of time.
In our example, the first task, unit_imputation, reports the number of replaced NaNs in our data set. Looking at trends of historical runs, we expect the overall task to complete within 1 hour of time. Based on where we place the metrics log in our code, we expect the metric to be reported about 30 mins after task start. We can use the expected behavior to create a conditional alert that gives us a great signal about what’s happening inside our process.
If duration of unit_imputation task is greater than 45 minutes AND replaced NaNs metric is missing, THEN trigger alert
Describing the logic behind this alert; the NaNs metric should be reported about halfway into task runtime, which typically takes 1 hour to fully complete. Our alert adds some buffer to that, saying we want a notification if the task runs for 45 minutes without sending any metric. Missing the metric over this duration of time serves as an early warning signal that the pipeline is hung up on something, potentially leading to further delays downstream. Since the alert is set up in the first task of the pipeline, we have enough time to intervene, check out the issue, and restart the process after a fix.
And what was required to set this up? Only a single use of the logging API. The rest takes a couple minutes to configure with Databand’s alerting framework.
Where Are We Now?
At this point, our alerting coverage includes:
- Overall run durations
- Task durations
- Data quality metrics
- Internal task progress
We’re now in a much better position to anticipate failures, delays, and problems in data quality. Databand makes the process easy, getting our first alerts up and running with zero changes to our pipelines, and deeper tracking available through the open source library and APIs.
Try it Out!
If you’d like early access, a demo, or to try out the system for yourself, send a note to our team at [email protected] or visit us at www.databand.ai.
For GCP Users
If you are using Apache Airflow on GCP, Google Composer, or Google Dataflow, Databand is soon to release a GCP Marketplace offering with a free trial to make testing the product on your infrastructure a breeze. Contact the team for early access.