Read our new blog about 'Avoid data SLA misses with Databand Dashboard'

Everyday Data Engineering: Tips on Developing Production Pipelines in Airflow

Databand
2021-01-29 15:12:50
everyday data engineering - databand's blog series on best practices for data engineers

Everyday Data Engineering: Databand’s blog series – A guide on common data engineering activities, written for data engineers by data engineers. We’ve gathered our team together to generate educational data engineering posts with our best practices for your day to day use!

There are a lot of guides out there for getting started with Apache Airflow; how to build data pipelines, how to schedule processes, how to integrate with various data systems like Snowflake and Spark, but when we started our journey with Airflow (long time ago 🙂 ), it took us time to find the best way to manage the lifecycle of our Airflow deployments – how to iterate and quickly deploy changes to production in our DAGs.

In this article, I’m going to discuss how we at Databand.ai, manage our pipeline development lifecycle, including how we deploy iterations over multiple Airflow environments – development, staging, and production. This post will talk more about development culture and less about internal technical stuff (we need to save something interesting for the next post!)

I hope that some of these practices help you in your everyday data engineering!

The problem we are trying to solve

Handling multiple Airflow environments is difficult, and becomes even more difficult when you try to set up a local environment in order to develop new DAGs.

Let’s assume I want to develop a new piece of python code that enriches my data during an ingestion process. This python code is part of a broader pipeline with different kinds of tasks. It is being called from a PythonOperator, which is part of a DAG that also includes a JVM Spark job called from the SparkSubmitOperator, and some analytical queries called by SnowflakeOperators on our Snowflake database.

With a DAG like this, that has multiple different systems as part of the execution, a lot of questions come to mind:

  • How do we make sure our python code is running as expected during the development process? 
  • How do we validate that it works in a real environment before we push it to production? 
  • How do we quickly discover and remedy issues? 
  • How do we make sure that what worked locally will also work on production?

Let’s try to answer those questions.

Folders structure

This is the folder structure we are using in our pipelines repository:

.

Let’s start with Airflow. We have our DAGs folder, containing all the logic of how to schedule and orchestrate our data pipelines. This folder contains only DAGs and Operators. You will not find here the source code of the python callables.

You can also see the docker-compose and helm folders. We found docker-compose to be a great tool for deploying a local development environment, and helm to be a good fit for our staging and production environments (I’ll come back to that later).

Moving forward, you can observe the three business logic folders, one for python (with setup.py file, so we can import it as a module), one for tech stuff (such as SQL queries for Snowflake or other DB), and one for jvm business logic (including java or scala files, and gradle configuration).

In these folders we keep our business logic – the code that actually does the ingesting and data enrichment. Structuring the folder in this hierarchy keeps it easy to run those tasks separately from Airflow if we want to test it locally, or just run it in different orchestrators such as Azkaban, Luigi, or even from a CRON job. It is important for us to treat the python business logic as a package so it will be easy to import it from everywhere.

In our Airflow DAG file, our code will look something like this:

from Airflow import DAG from Airflow.operators.python_operator import PythonOperator from python_business_logic import enrich_data with DAG( DAG_id="ingest_data_example", schedule_interval="0 1 * * *", # Daily default_args=DEFAULT_ARGS, ) as ingest_data_example: enrich_data = PythonOperator( task_id="enrich_data", python_callable=enrich_data, )

Pipelines Development Cycle

There are 4 steps in our development cycle that we found as most effective and also most efficient.

  1. Local development
  2. Run in docker-compose environment
  3. Run in staging environment
  4. Release to production

This 4 step process assures us that we are able to quickly identify problems before they happen in production, fix them quickly, and keep everything contained.

Of course we don’t want to cause too much overhead on the testing phase, so if our environments are set up correctly, testing in each of them should not take more than a few minutes.

Local development: The obvious step of developing our python code locally and making sure it’s running in our local Airflow environment.

Docker-compose env: After finishing developing our pipelines locally and making sure everything works, we spin up this environment using the same docker image we use in our staging and production environments. This step helps us to easily locate and fix problems in our pipeline that we didn’t encounter during the development process, with no need to wait for a k8s deployment or CI pipelines to run.

It is really easy to debug our DAG in a semi-production system with docker-compose, and to fix issues quickly and test again.

Staging environment: our staging environment is almost the same as our production one. It is running on a Kubernetes cluster deployed with helm3. The difference is that we create a staging environment for each git branch. After development is finished and everything works as expected in our docker-compose environment, when the developer pushes their code into the gitlab repository and opens a Merge Request, our CI\CD starts building it’s image with the new code and deploying it to a cluster with a specific namespace. We are using git-sync for our DAG folder in this environment, so we’ll be able to quickly fix issues we find here without waiting for a CD job to deploy our image again. When the code is being merged, this environment is deleted.

Production environment: This is the step where we publish our DAGs to production. The developer merges their changes into master, and the CI\CD process is capable of taking the new changes and deploying them into production. We managed to get to a place where it is very rare that we have problems with our production pipelines that we couldn’t encounter before.

Wrapping things up with some concrete tips:

  • Structure your folder in a way that your business logic is isolated and accessible whether you want to run it from Airflow or from another orchestration system.
  • Use docker-compose for development, and kubernetes with helm for staging and production deployment. Using a semi-production environment with docker-compose makes it so much easier to locate and fix issues with your pipelines and keep the development cycle short and effective.
  • As part of your CI\CD, make sure you create a staging environment for each Merge Request. In this way you’ll be able to test your DAGs before they go to production in an almost equivalent environment.
  • In your staging environment, make sure your DAG folder is git-synced with the deployment branch. This is how you quickly fix issues in your staging environment without waiting for a CI\CD pipeline to be finished.
  • Setup your environments correctly to ensure this process is efficient and not time consuming. Use the docker-compose and helm best practices and share the configuration between them.

Your production environment is a temple. Following the above steps will help you to make sure that all code that enters this environment was carefully tested. 

In a future post, we’ll cover more technicals in how our process works, common debugging flows, and test scenarios we like to cover for our DAGs.

Happy engineering!

– This post was written by Niv Sluzki, Databand’s experienced Software Engineer. Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. To learn more about Databand and how our platform helps data engineers with their data pipelines, request a demo or sign up for a free trial!

What is Good Data Quality for Data Engineers?

Read next blog