6 inconvenient truths about Apache Airflow (and what to do about them)

Databand
2021-12-03 16:54:36

Data teams that work with complex ingestion processes love Apache Airflow.

You can define your workflows in Python, the system has wide-ranging extensibility, and it offers a healthy breadth of plugins. Eighty-six percent of its users say they’re happy and plan to continue using it over other workflow engines. An equal number say they recommend the product. 

But, like all software, and especially the open source kind, Airflow is plagued by a battery of gaps and shortcomings you’ll need to compensate for. For developers just getting acquainted with it, that means the starting is slow and the going is tough. In this article, we discuss those issues and a few possible workarounds.

6 issues with using Airflow

1. There’s no true way to monitor data quality

Airflow is a workhorse with blinders. It doesn’t do anything to course-correct if things go wrong with the data—only with the pipeline. Virtually every user has experienced some version of Airflow telling them a job completed and checking the data only to find that a column was missing and it’s all wrong, or no data actually passed through the systems.

This is especially true once the data organization matures and you go from 10 data acyclic graphics (DAGs) to thousands. In that situation, you are likely now using those DAGs to ingest data from external data sources and APIs which makes controlling data quality in Airflow even more difficult. You can’t “clean” the source dataset or implement your governance policies there.

While you can create Slack alerts to check each run manually, to incorporate Airflow as a useful piece of your data engineering organization and hit your SLAs, you want to automate quality checks. And to do that, you need visibility into not just whether a job ran, but whether it ran correctly. And if it didn’t run correctly, why, and where the error originated. Otherwise, you’ll be living through Groundhog Day. (GroundDAG day?)

This is not a simple challenge and if we’re being candid, it’s why our founders built Databand. Most product observability tools such as Datadog and New Relic were not built to analyze pipelines and can’t isolate where issues originated, group co-occurring issues to suggest a root cause, or to suggest fixes. 

However, the need for observability is still not yet fully understood, even within the Airflow community. Today, only 32% say they’ve implemented data quality measurement, though the fact that the survey’s drafters are asking is an indication of improvement. They did not ask this question in the 2019 or 2020 surveys.

How does one go about monitoring data quality in Airflow? In truth, Airflow gets you halfway there. As its maintainers point out, “When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.”

Airflow offers that formal representation of code. What you need is an observability tool built specifically to monitor data pipelines. Those built to monitor products are a halfway measure, yet usually part of the journey because they already have those licenses. 

We find there are several phases engineering organizations go through on their journey to full observability maturity:

  • Pre-awareness: Not monitoring data quality (68% of the Airflow community).
  • Duct tape and baling wire: Borrowing product observability tools and making it work, though it may not be ideal.
  • Purpose-built solution: Adopting full-pipeline observability tools like Databand to automate alerts, isolate root causes, and fix issues faster. Set machine learning around expected data parameters, get Slack alerts that indicate missing data or schema changes in Airflow Scheduler, trace issue lineage back, and back-test through historical data.

2. Airflow onboarding is not intuitive

Learning Airflow requires a time investment. Numerous articles and Stack Overflow threads document the travails of developers who get stuck on basic questions, like, “Why did the job I scheduled not begin?” (A common answer: The Airflow Scheduler begins scheduling at the end of the scheduled time period, not the beginning. More on that later.) 

Furthermore, to become competent with Airflow, you will need to learn Celery Executor and either RabbitMQ or Redis, and there is no way around this.

This friction is sufficient that some organizations like the CMS software company Bluecore decided it was easier to essentially code their own Airflow interface. That way, each new developer they hired or assigned wouldn’t have to learn all the new operators, and instead, could rely on the Kubernetes ones they were already familiar with. 

These learning hurdles are enough of a recurring problem for the community that “onboarding issues” warranted its own question on Airflow’s 2021 community survey (pictured below). 

Among users’ top grievances were “a lack of best practices on developing DAGs” and “no easy option to launch.” This latter issue has been partially addressed in Airflow Version 2.0 (which was released after the survey), but this version runs on an SQLite database where no parallelization is possible and everything happens sequentially. 
As Airflow’s Quick Start guide points out, “this is very limiting” and “you should outgrow [this] very quickly.”

table showing results from Airflow’s 2020 community survey

3. The Airflow Scheduler interval is not intuitive

Airflow’s primary use case is for scheduling periodic batches, not frequent runs, as even its own documentation attests: “Workflows are expected to be mostly static or slowly changing.” This means there are few capabilities for those who need to sample or push data on an ad hoc and ongoing basis, and this makes it less than ideal for some ETL and data science use cases. 

There’s more. We alluded to this before, but the Airflow Scheduler runs schedule_interval jobs at the end of the start Airflow Scheduler interval, not the beginning, which means you’ll be doing more pocket math than you might like, and occasionally find yourself surprised. 

image of Airflow documentation

And to properly run those scheduled jobs, you’ll need to learn the Airflow-specific nuances between operators and tasks, how DAGs work, default arguments, Airflow metadata database, the home director for deploying DAGs, and the list goes on. 

image with list of common Airflow definitions

The fix? You might consider joining the 6% of Airflow users who develop their own graphical user interface and rename the operators in terms that make more sense to them. 

4. No versioning in Airflow Scheduler

You’ll find many traditional software development and DevOps practices missing from Airflow, and a big one of those is the ability to maintain versions of your pipelines. There’s no easy way to document all that you’ve built and, if needed, revert to a prior version. If, for example, you delete a Task from your DAG and redeploy it, you’ll lose the associated metadata on the Task Instance. 

This makes Airflow somewhat fragile, and unless you’ve written a script to capture this yourself, it makes debugging issues much more difficult. It isn’t possible to backtest possible fixes against historical data to validate them.

Again, Airflow does provide the formal code representation. Your challenge is applying other software development and DevOps tools to fill the missing functionality. 

5. Windows users can’t use it locally

Not much else to say here. Unless you use specific Docker compose files which aren’t part of the main repository, it’s not possible.

6. Debugging is time-consuming

Airflow Scheduler not working? Better refill your coffee. You may have some time-consuming debugging ahead of you.

That’s because, in our opinion, Airflow doesn’t sufficiently distinguish between operators that orchestrate and operators that execute. Many operators do both. And while that may have helped with the initial coding of the platform, it’s a fatal inclusion that makes it very difficult to debug. If something goes wrong, your developers will have to examine their DataFlow parameters first, then the operator itself, every single time.

For this reason, tools like Databand can be a big help. Databand excels in helping you understand the health of your infrastructure at every level: global Airflow, DAG, task, and user-facing. Instead of spending data engineering time on learning highly specific features, Databand allows data engineers to really focus on solving problems for the business.

Apache Airflow—a stellar option despite flaws

Like any open source contributor who takes time to propose new changes, we hope this article is construed as the love note that it is. We here at Databand are active contributors to the Airflow community and eager to see it grow beyond its existing limitations and to better serve more ETL and data science use cases. 

As we said before, 86% of users plan to stick with it over other operation engines. Another 86% say they’d highly recommend it. We’re happy to say we belong to both groups—it’s a great tool. And for those of you just getting acquainted with Airflow, just know that if you go in with aforementioned issues in mind, Airflow Scheduler can be well-worth the effort.