1. There’s no true way to monitor data quality
Airflow is a workhorse with blinders. It doesn’t do anything to course-correct if things go wrong with the data—only with the pipeline. Virtually every user has experienced some version of Airflow telling them a job completed and checking the data only to find that a column was missing and it’s all wrong, or no data actually passed through the systems.
This is especially true once the data organization matures and you go from 10 data acyclic graphics (DAGs) to thousands. In that situation, you are likely now using those DAGs to ingest data from external data sources and APIs which makes controlling data quality in Airflow even more difficult. You can’t “clean” the source dataset or implement your governance policies there.
While you can create Slack alerts to check each run manually, to incorporate Airflow as a useful piece of your data engineering organization and hit your SLAs, you want to automate quality checks. And to do that, you need visibility into not just whether a job ran, but whether it ran correctly. And if it didn’t run correctly, why, and where the error originated. Otherwise, you’ll be living through Groundhog Day.
This is not a simple challenge and if we’re being candid, it’s why IBM® Databand® was created. Most product observability tools such as Datadog and New Relic were not built to analyze pipelines and can’t isolate where issues originated, group co-occurring issues to suggest a root cause, or to suggest fixes.
However, the need for observability is still not yet fully understood, even within the Airflow community. Today, only 32% say they’ve implemented data quality measurement, though the fact that the survey’s drafters are asking is an indication of improvement. They did not ask this question in the 2019 or 2020 surveys.
How does one go about monitoring data quality in Airflow? In truth, Airflow gets you halfway there. As its maintainers point out, “When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.”
Airflow offers that formal representation of code. What you need is an observability tool built specifically to monitor data pipelines. Those built to monitor products are a halfway measure, yet usually part of the journey because they already have those licenses.
We find there are several phases engineering organizations go through on their journey to full observability maturity:
-
Pre-awareness: Not monitoring data quality (68% of the Airflow community).
-
Duct tape and baling wire: Borrowing product observability tools and making it work, though it may not be ideal.
-
Purpose-built solution: Adopting full-pipeline observability tools like Databand to automate alerts, isolate root causes, and fix issues faster. Set machine learning around expected data parameters, get Slack alerts that indicate missing data or schema changes in Airflow Scheduler, trace issue lineage back, and back-test through historical data.
2. Airflow onboarding is not intuitive
Learning Airflow requires a time investment. Numerous articles and Stack Overflow threads document the travails of developers who get stuck on basic questions, like, “Why did the job I scheduled not begin?” (A common answer: The Airflow Scheduler begins scheduling at the end of the scheduled time period, not the beginning. More on that later.)
Furthermore, to become competent with Airflow, you will need to learn Celery Executor and either RabbitMQ or Redis, and there is no way around this.
This friction is sufficient that some organizations like the CMS software company Bluecore decided it was easier to essentially code their own Airflow interface. That way, each new developer they hired or assigned wouldn’t have to learn all the new operators, and instead, could rely on the Kubernetes ones they were already familiar with.
These learning hurdles are enough of a recurring problem for the community that “onboarding issues” warranted its own question on Airflow’s 2021 community survey (pictured below).
Among users’ top grievances were “a lack of best practices on developing DAGs” and “no easy option to launch.” This latter issue has been partially addressed in Airflow Version 2.0 (which was released after the survey), but this version runs on an SQLite database where no parallelization is possible and everything happens sequentially.
As Airflow’s Quick Start guide points out, “this is very limiting” and “you should outgrow very quickly.”