DataOps Tools: Key Capabilities & 5 Tools You Must Know About
DataOps, short for data operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data processes across an organization.
Apache Airflow allows data engineers to author, schedule, and orchestrate long-running tasks programmatically. Best of all, it’s open-source and constantly improved by the community. But you already knew that.
The goal of this resource is to give you the best practices, strategies, know-how, and tools you’ll need to set up reliable monitoring and observability around Apache Airflow.
Here’s what we’ll be covering:
You can revisit this table of contents to jump to relevant sections whenever needed. Now, let’s get into it.
Monitoring plays a crucial role in data management. You can track your pipeline performance to ensure that data is delivered in a way that adheres to your governance policies. Monitoring capabilities are significant for companies that use Airflow to orchestrate and schedule their long-running tasks.
Airflow has some built-in monitoring capabilities that can help you do this.
Airflow’s native UI lets you visualize your DAG and task statuses. In addition, you can monitor a few native metrics from this UI, but there is a lot of room for improvement (we’ll get into that later). This can help you do some light monitoring and troubleshooting for your DAGs.
The DAGs view provides you with a list of DAGs in your environment and a set of shortcuts to other built-in monitoring capabilities. Here, you can see your DAG names, its owner, recently executed runs & tasks statuses, and some quick actions.
If you have multiple teams working in this environment, you should add tags to the pipeline to make monitoring more manageable. You can add team tags to the DAG object in your Dag file. According to Airflow’s documentation, here’s an example of what that could look like:
dag = DAG(
dag_id='example_dag_tag',
schedule_interval='0 0 * * *',
tags=['example']
)
This filter is then saved as a cookie, and you can add the filter by typing it into the filter input field on the top of the DAG view.
The Tree View lets you go a level lower into a specific DAG. You can view how all the tasks are ordered within the DAG and the status for each associated run. This lets you see how Run and Task state over time. The Graph View is a little easier on the eyes, but we generally recommend this view because you can see more than one Run at a time and more quickly identify problems.
While this view can be great, it becomes hard to manage when the DAG is complex and there are many different Runs; especially given that the Status colors and borders can be difficult to differentiate.
To make this a little easier, Airflow’s Webserver does allow you to customize the TaskInstance and DagRunState colors. In order to do so, create an airflow_local_setting.py
file to put on $PYTHONPATH
or inside the $AIRFLOW_HOME/config
folder (NOTE: Airflow adds $AIRFLOW_HOME/config
on $PYTHONPATH
when Airflow is initialized). Once you’ve done that, Airflow’s documentation suggests you add the following contents to the airflow_local_setting.py
file:
STATE_COLORS = {
"queued": 'darkgray',
"running": '#01FF70',
"success": '#2ECC40',
"failed": 'firebrick',
"up_for_retry": 'yellow',
"up_for_reschedule": 'turquoise',
"upstream_failed": 'orange',
"skipped": 'darkorchid',
"scheduled": 'tan',
}
You can customize the colors however your team prefers. After that, just restart the Webserver to see the changes.
Going a level deeper, you can also view the user code from Airflow’s UI. While the code for your pipeline technically lives in source control, this view helps you find errors in logic if you have enough context.
Most Runs within Airflow are scheduled automatically without manual interaction. This makes having a log of what happened during the run super important. Luckily, Airflow has some great built-in logging capabilities. This makes it possible to find the cause of issues in development environments.
All of the logging in Airflow is implemented through Python’s standard logging
library. By default, Airflow logs files from the WebServer, the Scheduler, and the Workers running tasks into a local system file. That means when the user wants to access a log file through the web UI, that action triggers a GET request to retrieve the contents. For cloud deployments, Airflow has handlers contributed by the Community for logging to cloud storage such as AWS, Google Cloud, and Azure.
To access your logs from Airflow’s UI, click on the task you are interested in within the Tree View and click the “View Log” button.
Data lineage is pretty young as far as features for Airflow go. That said, a lot of development has recently gone into improved lineage support and making it much easier to use. This feature can help you track the origins of data, what happens to it and where it moves over time. This gives you an audit trail, can help you measure your Airflow’s adherence to your data governance policies, and debugging of your data flows.
This feature is useful when you multiple data tasks reading and writing into storage. The user needs to define the input and output data sources for each task, and a graph is created in Apache Atlas, depicting the relationship between various data sources. That said, it can be clunky to deal with. You can reference Airflow’s documentation to explore how it works. If you don’t want all the headaches, there are some third-party tools (including ours) that make integrating this capability easy.
Apache Airflow is great at doing what it’s built to do: orchestration. Though when it comes to monitoring, it can be hard to manage without some finessing.
The truth is that monitoring Airflow can be cumbersome at times. When things go wrong, you’re suddenly jumping between Airflow’s UI, operational dashboards, python code, and pages of logs (We hope you have more than one monitor to manage all of this). That’s why “logging, monitoring, and alerting” was tied for second as an area for improvement in Airflow’s 2020 user survey. What makes monitoring Airflow so difficult? Well, mainly for three reasons that build onto each other.
Airflow is familiar with your data pipelines. It knows all about your tasks, statuses, and how long they take to run. It has awareness around execution. It doesn’t know anything about your DAGs.
There are plenty of issues that can happen to your data outside of what execution metadata would tell you. What if your data source doesn’t deliver any data for some reason? Airflow would show all green on the Webserver UI, but your data consumer would have stale data in their warehouse. What if data is delivered, but an entire column has missing values? Airflow says everything is good, but your data consumers have incomplete data. What if data is complete, but an unexpected transformation occurs? This may not cause a task to fail, but inaccurate data will be delivered.
You may be able to set some alerts around Run & Task duration that may help notify that something is up. That said, you wouldn’t have the flexibility you need to cover all of your blind spots, and you would still need to spend time diagnosing a root cause. This brings us to our next point.
As we stated earlier: Airflow is great for orchestrating tasks. That’s what it was built to do. Understandably, the community doesn’t focus much on building out a full-fledged monitoring and observability solution inside Airflow. It falls a little too far outside the project’s original scope. But it couldn’t be completely naked, so there are some simple monitoring capabilities around the pipeline and tasks themselves.
Airflow provides you a high-level overview of your operational metadata like Run & Task states. You can set up some simple alerting around that metadata and fetch logs. While that’s great, it doesn’t give you the context mentioned in the first point. So, you’ll need to build out operational dashboards to visualize metrics over a time series to see how your data changes over time. You’ll need to add data quality listeners (Deequ, Great Expectations, Cluster Policies, Callbacks, etc.) to your DAGs to pull metadata around your datasets. Then, you can create custom alerts once you have metadata and trends to work with. And that brings us to our final point.
You now have many moving pieces just to monitor your Airflow environments. You have alerts going to email, operational metadata & logs in Airflow UI, and your metrics reporting in separate dashboards. This process might work for you if your Airflow environments are limited in scope, but it’s a problem if you’re working with 100s of DAGs across multiple teams. You won’t be able to view the health of your Airflow environments through a single pane of glass. Different teams will use different dashboards, and alerts that don’t route to your organization’s preferred receiver can go unnoticed.
This operational debt means that it won’t be easy for your engineers to catch issues early and prioritize fixes before you miss your data SLAs.
The ideal Airflow monitoring dashboard would essentially be able to do the converse of the three points mentioned above.
You would have a dashboard to see performance metrics and trends for operational and dataset metadata. You would be able to set complex alerts based on these trends so you can get ahead of possible SLA misses. You’d be able to centralize your metadata, metric visualization, logs, and alerts in one place, so your monitoring capabilities are extensible and efficient.
What we’re describing here isn’t necessarily a monitoring dashboard anymore. That’s a data observability platform.
“Data monitoring” lets you know the current state of your data pipeline or your data. It tells you whether the data is complete, accurate, and fresh. It tells you whether your pipelines have succeeded or failed. Data monitoring can show you if things are working or broken, but it doesn’t give you many outside contexts. You know what goes in. You know what comes out. But what happens in between? And why the discrepancy? That’s Airflow monitoring.
Data observability, on the other hand, is a blanket term for monitoring and improving the health of data within applications and systems like data pipelines. Data observability includes all of these activities:
By encompassing not just one activity—monitoring—but rather a basket of activities, a data observability platform is much more useful to engineers. Data observability doesn’t stop at describing the problem. It provides context and suggestions to help solve it. Having this level of visibility into your Airflow environments is critical for your data-intensive organization.
Databand gives data engineering organizations a streamlined and extensible data observability platform for their entire data platform, including their Apache Airflow environments.
When you connect Databand to your Airflow instance, Databand collects and tracks metadata from your DAGs and your data lake & data warehouse tables to give you true end-to-end data observability. By collecting all of your metadata on your pipelines and datasets, your team has the context they need to cut down their time-to-detection and time-to-resolve OKRs significantly. More importantly, they can fix issues before SLAs are missed, and bad data infects downstream business processes.
Databand allows you to easily and visualize your operational metadata in a time series. You can add filters to view metadata from specific projects, sources, or pipelines.
When there is an error, you can home in on the issue from the Run View. Databand gives your insights into the data running through your pipelines and the root cause of your data issue.
In the dataset view, you can see what operations are reading and writing into your tables over time. This allows you to see how much data is being moved in and out of the table. If there’s an issue, you can see which pipelines it will affect downstream and their corresponding tables.
Once you know what the issue is, Databand fetches logs written from Airflow, S3, or where ever the issue occurred, so you can get additional context into the issue.
Setting up complex alerts in Databand is easy. All you have to do is choose a metric and then define a definition. The metrics in this screenshot offer you much more flexibility than native Airflow alerting, and the out-of-the-box metrics provided are usually enough to get you started. If you need to track additional metrics, you can create custom alerts.
Defining your alert definitions is hard when your data is constantly changing. That’s why we build out robust anomaly detection functionality into Databand. This way, you can reduce the amount of time your engineers spend manually adjusting alerts based on recent performance trends.
Databand’s alerting engine is built on Prometheus+AlertManager, which allows us to integrate to Slack, PagerDuty, Opsgenie, and any other receiver found in the AlertManager Documentation.
Looking for more? Here are links to some more Airflow & monitoring content!
Apache Airflow monitoring
Apache Airflow best practices
All in all, Apache Airflow is a great tool that even lets you monitor your DAGs on a small scale. When you are using Airflow to orchestrate hundreds and thousands of tasks, it leaves a lot to be desired.
In this guide, we covered some of the basics of Airflow monitoring, best practices for monitoring and observability, and some of the functionality you’ll need to make monitoring Airflow manageable at scale. We hope you found it useful.
Are you interested in seeing how Databand could help you monitor your Airflow environments? Schedule a product demo with the team!