Read our new blog about 'Airflow 2.0 and Why We Are Excited at Databand'

Top 6 Airflow Features To Look Out For

Databand
2020-09-25 15:05:29

Airflow’s defining feature is the flexibility to intake and execute all workflows with code. As an engineer, all of the opportunity for configuration is extremely powerful for making Airflow fit your needs, but it’s definitely a time-intensive investment to learn (and implement) every one of the Airflow features available.

If you are using Airflow today, it’s helpful to understand Airflow’s high level of configuration and the tools you have at your disposal. In this post, we’re outlining the top utilities that we found to be the most useful to help you decipher the great features Airflow has to offer.

A lot of engineers select Airflow because of its great web UI, but its core orchestration capabilities are likewise powerful, and tapping into more of those Airflow features will help you produce a more optimized infrastructure and higher engineering productivity.

Benefits of code as abstraction layer

Using only code for your data flows improves transparency and reproducibility of failures. When your workflows are automated with only code, your ELT failures are much easier to troubleshoot because no part of the broken process is trapped in a developer’s head.

As an Airflow co-creator, Maxime Beauchamin writes, in Rise of the Data Engineer:

“Code allows for arbitrary levels of abstractions, allows for all logical operation in a familiar way, integrates well with source control, is easy to version and to collaborate on”

 

Challenges of code as abstraction layer

Running your Data Infrastructure as code through Airflow, in conjunction with a suite of Cloud Services, is a double edged sword. Companies often need to build out Minimally Viable Data Products, such as integrating a data source, while leveraging all available tools to enable engineers to focus on business domain specific problems.

1. Official Airflow Docker Image

Official Apache Airflow Docker Image

For a long time, Airflow developers had to either automate their own environments, use 3rd party Docker images, or go without any automation in their deployed work. In early 2019, Airflow released an official Dockerfile in Airflow version 1.10. This means that you can run your business’s Airflow code without having to document and automate the process of running Airflow on a server. Instead, you can pull this official docker image with this command:

docker pull apache/airflow

And as long as you have the required dependencies, like Python and a proper SQL Server, you can deploy your Airflow DAGs with minimal headache.

2. Web Hooks

Web Hooks are mostly used to connect with databases or platforms in a safe way. You can store your connection details in the Airflow metadata database, as normal, but Hooks form an abstraction layer between that database and your pipelines.

The main reason you would want to build these abstraction layers, or interfaces, to your platforms is security. Keeping authentication details outside of your pipeline executions is a best security practice, especially as Data Pipelines’ information grasp grows wider.

The other major reason to to leverage hooks, is an increasingly complex business architecture, spanning many services. If you aren’t opposed to looking through some code, Airflow provides many open source hooks to popular tools such as; Slack, GCP services, AWS services, Databricks, MongoDB, Postgres, etc.

An example of a custom hook that takes care of a highly repeatable task would be a Slack Hook connection, sending custom Data Pipeline warning alerts to a Slack channel. You can be proactive about Pipeline execution and latency by building alerts on hooks which connect data that is most important to the business. You can only start to collect data going forward, so the sooner you can start generating a source of record for performance, the better.

3. Custom XCom Backends

XCom stands for cross-communication between Airflow tasks. So, when a task finishes and returns a value, an XCom can then push that value to be made generally available to any other task that wishes to pull that value for its own execution. This becomes useful when you need specifically inter-task communication and want to avoid a global setting like Variables and Connections.

You can build your own XComs to change the behavior of your system’s serialization of task results. This kind of customization is quite technical and requires yet another parameter to keep track in your DAGs. However, you can start to collect task metadata from these XCom Backends, which builds observability into your infrastructure.

If you want deeper visibility into pipeline health and data quality, check out Databand. This custom metadata gathering is one of Databand’s coolest features. Databand uses one of several custom APIs which collect metadata from Airflow tasks, at the moment of execution. Building out this kind of metadata gathering is really time-intensive. Of course it’s an investment that pays off, but the benefit of Databand being this observability built in and available instantly.

4. Variables and Connections

Airflow needs to be able to connect to data entities; Databases, APIs, Servers, etc. Variables and Connections are Airflow features that allow you to ensure these connections without hard coding them into your workflows every time you need to connect with an outside entity.

Variables and connections are a common way in which Airflow communicates with the outside world.

Variables and Connections are a bit like global parameters for your Airflow code base. This becomes especially important in light of Airflow using code as the abstractions level of choice. By using variables and connections, you can more easily remove Data Silos and Intellectual Property (IP) Silos by protecting sensitive data, while also making it available for Engineering use.

Data Silos

Connecting Data sources to a central Data Warehouse Destination lets Analysts have transparency across all relevant data

IP Silos

Transparent and Reproducible workflows reduce human bottlenecks because Data Teams have complete access to the code that broke and the logs output. You don’t have to wait for a potential bottleneck because they’re the only ones who know the correct button to push. All stack traces, logs, and error codes are available to see.

5. Packaged DAGs

Packaged DAGs are a way to compress and bind your Airflow project files in a single zip file. You can follow the following code steps to zip Airflow files, or Packaged DAGs, in a way so that they can still be integrated into your Data Infrastructure’s code base.

 

virtualenv zip_dag
source zip_dag/bin/activate
mkdir zip_dag_contents
cd zip_dag_contents
pip install — install-option=” — install-lib=$PWD” my_useful_package
cp ~/my_dag.py .
zip -r zip_dag.zip *

There are several reasons why you would need to use Packaged DAGs. If you have many Airflow users in a business where the development environments can be assured to be consistent throughout the business. With Packaged DAGs, versioning and collaboration is made much easier, especially when they become complex and use multiple files. You don’t have to send or denote project files individually and can promote healthy file structure with feature based organization.

 

6. Macros

Macros are particularly useful when running ELT jobs that span large amounts of time. For example, backfills that have a parameterized date for every date of data that needs to be moved or transformed. Using an execution date macro in your Data Pipeline allows you to break up your pipelines’ memory and storage workloads into daily partitions. With an execution date macro, you can more easily manage pulling data from a REST API by managing the frequency and amount of data you’re attempting to pull.

Macros are used with Jinja templating, which allow you to input your parameters in the strings that form your architecture; organized S3 keys, backfilling data sources, or even real-time DAG or task instances’ metadata. Macros such as {{ prev_execution_date_success }} allow you to gather DAG metadata from previous runs. Thankfully, Databand has three custom APIs, one of which handles DAG metadata. These facts allow you to measure the healthiness of your Data Architecture, from real-time to trends over time.

Conclusion

With a tool as extensible as Airflow, there also comes the challenge of knowing which features are available. More importantly, there is also the challenge of knowing which Airflow features meet your business needs.

There are features for every level of data Infrastructure needs, from just starting out, to advanced observability needs. If your infrastructure is just starting out and you want to think about the most important global parameters; then macros, variables, and connections are Airflow features you would most want to look at. If your business is looking to create more observability into your infrastructure, then custom XCom backends are probably a great Airflow feature to look into.

Databand excels in helping you understand the health of your infrastructure at every level; global Airflow, DAG, task, and user-facing. Instead of spending data engineering time on learning highly specific features, Databand allows Data Engineers to focus on business specific problems.

To learn more about Databand‘s Monitor platform and Open Source Library and how it can help you make your data engineering job that much more efficient, Request a Demo here.

Announcing Our Series A, with $14.5M in Funding

Read next blog