6 data integration principles for data engineers to live by
Today, there is more. More data sources in the form of vendors, products, tools, third parties, and devices. More data complexity in terms of size and velocity. And more data demands from consumers. Yet there’s only you and your team to integrate it all.
Where do you begin? How do transformation tools like Singer, Meltano, and Airbyte compare? How will you ensure the integrity and quality of the data at great scale?
To all of these questions, there are no universal answers. Only generally accepted principles that can (hopefully) consistently guide you to the right choices. In this article, we’ll explore those data integration principles, as well as some of the enterprise data integration tools you may need to make it happen.
What is data integration, again?
Data integration is the umbrella term for much of what we do as data engineers. It’s the process and method of identifying data, ingesting it, transforming it, and making it available to consumers. It’s “integrating” new and old data within and without the business and making it ready for use. If data integration teams had a Latin motto, it’d be similar to that of the United States: E Pluribus Unum, or “Out of many, one [table].”
Yet unlike, say, data ingestion, which is just one part of data integration, integration carries through into the analysis phase of data engineering. This means it touches on the visualization layer and what happens in the business intelligence (BI) tool. Thus, it carries more responsibility for data outcomes.
Data integration has gone through a number of fads over the past decade, so you can be forgiven for not being up to speed on the latest. Today, the extract, load, transform (ELT) approach is in favor, partially because analytics costs have plummeted. Fewer teams worry about a surprise check from AWS or GCP. And also, ELT allows you to prepare for unforeseen future use cases, so doing less can mean more flexibility. The result is people are dumping more raw data into lakes for later analysis and transformation.
And because data integration carries through to the visualization layer, data engineers are responsible to those end consumers. Consumers need the data to be accurate, available, and as expected. But if your enterprise is drawing from thousands of data sources and running thousands of directed acyclic graphs (DAGs), serving all those consumers becomes quite a challenge. You cannot be hand-coding those integrations every time and still maintain them.
That’s why pipeline automation tools like Airflow and Spark exist—to help you run and manage many DAGs and integrate data with precision. It’s also why observability platforms like Databand exist: to help you understand the character and quality of your data at every stage, so you’re delivering on your data SLAs, and achieving universal-ish accessibility to support everyone’s products and projects.
6 data integration principles to live by
With the above definition in mind, here are six principles to integrate smarter.
1. Never integrate without justification
Don’t integrate data if you can’t articulate its purpose. Said another way, protect your existing data and force every data integration request through a battery of logical scrutiny. Otherwise, inconsistencies will spread like radioactive material and cause all else to decay.
When the definitive history of big data is written, it’ll be divided into two sections: the period when every business captured everything possible, and the period following, when companies realized that wasn’t a great idea. In that earlier phase, we saw initiatives like master data management (MDM) and central data platforms (CDPs) where the storage systems grew bloated with such great volumes of irrelevant and unstructured data, they became unusable.
It’s a big reason why a Gartner report a few years ago found that only 20% of data initiatives were ever completed, and only 8% drove discernable value.
Instead, modern data engineers know better, and require justification before ingesting new data sources. If you require that other teams complete a brief when making requests, you’ll find it helps force line-of-business managers or data scientists to think through the utility, merits, and downsides of their requests, and to consider the tradeoff in terms of your team’s time, and the stability of the overall system.
2. Perform quality checks even when you ELT
Even if you’re just loading data for later transformation (ELT), you’ll still want to do some transformation to assess and ensure quality. For example, checking columns and unintentional null values.
It’s useful to take the data “ecosystem” analogy here literally—if you think about all the data passing through the organization as being alive, and it sometimes contains errors (viruses), you need checkpoints and quarantines. Otherwise, if you have a highly networked data environment with countless highly variable sources and thousands of DAGs—and you aren’t constantly checking for data quality—you have a system that’s highly susceptible to data error disease.
(Read: If it’s in your warehouse, it’s too late.)
The challenge is, data pipeline tools like Airflow and Spark can’t check for data quality. They’re designed to tell you whether the job ran correctly—but a job can run correctly and the data be entirely corrupted. This often happens when running legacy internal data that nobody thought to double-check, and which contains missing dates or columns. Or, it happens with external vendor sources that are missing the granularity your destination system is expecting, or when ingesting barely structured tranches straight from IoT devices.
But that’s why data observability tools (like Databand) exist: They allow you to sample data flows for health at every stage of the pipeline, and to:
- Set up alerts with anomaly detection
- Automatically pause jobs with serious data issues
- Group co-occurring errors to help you isolate root causes
- Assess quality along custom parameters
3. Chunk up your pipelines into stages for debugging
Following on from the prior point, build for modularity. A pipeline that’s chunked up into many discrete segments is much more easily diagnosed and debugged, for reasons that are probably obvious—a monolithic pipeline has to be checked from end-to-end.
This is particularly important if you get your data from a wide variety of sources. An ideal and highly stable set of pipelines draws from very few sources of well-understood provenance with high data integrity. The most difficult pipelines draw from the opposite—many sources, unknown provenance, and questionable integrity. And the latter type is what most of us are working with.
As a corollary to this point: Normalize and QA those sources in pre-pipeline layers before they interact with anything important.
4. Set data SLAs and have an incident management plan
Who’s expecting what sort of data from you, and when? That’s a foundational question to your enterprise data integration strategy, and one data engineers probably spend less time pondering than they should. It’s all too easy to get drawn into the mechanics of your data fiefdom and to lose sight of the fact that you serve data consumers and should be putting their needs first.
That’s why it’s a good idea to:
- Establish internal data SLAs—it’ll keep you sharp and aligned with consumers
- Publish a data incident management plan—so when things go wrong everyone knows precisely their role and how to act
5. Establish centralized processes and definitions
Publish a data dictionary, and establish a way to centrally manage schemas, even if all you can do right now is create a shared spreadsheet. Do all of this earlier than you think. The moment you want to scale your systems will also be the moment you’ve become too busy to document what you’ve done or train others. This will lead to data drift, and those slight schema shifts will mean one of three things:
- You’ll become forever reliant on one-off coding and tribal knowledge to integrate repositories, or;
- You’re in for an awfully painful data cleanup project, or;
6. Automate what you can so you can focus on what matters
This one mostly speaks for itself. Bright as humans are, and as necessary as we are as tastemakers, dreamers, and question askers, some elements of data integration are best left to the machines. If you can automate orchestration, data quality checks, anomaly detection, and column-level profiling, you and your engineers are freer to work on higher-value tasks. For example, measuring the success of the data platform. Or discovering opportunities to optimize, or building to reduce technical debt.
But of course, while you should trust your elements of automation, you should also verify. For that, an observability platform can be invaluable.
Data integration tools
What data integration tools will you need to get the job done? Don’t be fooled by all that talk of “a modern data tech stack”—most of the examples you hear about are from no-code/low-code use cases (e.g. to help consumers make quick decisions with Salesforce data) or analytics use cases. Which means collectively, all those articles you see on Medium have a bias: They’re not for you. A modern data engineer’s tech stack looks different, and can involve any combination of the following tools.
Pipeline workflow management and orchestration
- Apache Airflow, Dagster, Prefect—open-source pipeline workflow management
- Jenkins, GitLab, CircleCI, Argo—DevOps orchestration tools
- Kafka, Beam, Flink—streaming systems
- Apache Spark—open-source distributed data processing system
- Google Dataflow—managed data streaming system built on Apache Beam
- DBT—SQL-based data transformation tool
Data ingestion tools
- Fivetran—managed enterprise data integration tool
- Singer ETL—JSON-based open-source data integration tool
- Meltano ELT—open-source data integration tool
- Airbyte ELT—managed mid-market data integration tool
- S3, Azure Blob Storage, Google Drive, or GCS—cloud-based data lake
Becoming your own data integration specialist
What’s the best enterprise data integration architecture? It’s the one that works for you. ELT or ETL, Airflow or Spark, and no-code or pro-code architecture are all considerations, with no universal answer. But what does hold true is that if you follow the six principles—if you demand justification, consistently cleanse, build for modularity, set data SLAs, centralize schemas, and automate anything that doesn’t demand your attention—you’ll be in great shape.
And to know whether you’re in great shape, and to stay in great shape, there’s observability.