If you’re like us, schematics for an ideal data pipeline are nice, but not always helpful. The gap between theory and practice is vast and it’s common for people to make suggestions online irrespective of realities like, say, budget. Or, without knowing that you rarely spin up data pipeline architecture entirely from scratch.
In this guide, we share advanced-level strategies for managing data pipelines in the real world, so you appear to be the data ninja your team already thinks you are.
What is a data engineering pipeline?
A data pipeline is a series of connected processes that moves data from one point to another, possibly transforming it along the way. It’s linear, with sequential and sometimes parallel executions. The analogy—“a pipeline”—is also helpful in understanding why pipelines that move data can be so difficult to build and maintain.
How should a data pipeline work? Predictably, only changing it in expected ways. A pipeline should be designed from the ground up to maintain data quality, or data health, along the four dimensions that matter: fitness, lineage, governance, and stability.
The problem with opaque data pipelines
Sometimes you come in one morning and the pipeline is down and you have no idea why. Sometimes someone knows why and you fix it in minutes. More often, the diagnosis takes days.
The challenge is the way pipelines tend to be built—opaque—just like real-life oil-bearing pipelines. You can’t peer inside. If there’s a leak somewhere, or a screwy transformation, it takes a lot of time to figure out where that’s happening, or whether the pipeline is even responsible. What if it’s an issue with an upstream data provider? Or if it is indeed your issue, are you uncovering proximal causes or root ones?
The issue is often several degrees deep. If your scheduler runs and tells you it was successful, but all values are missing, you have an issue. When you dig in, perhaps a Spark job failed. But why did it fail? This is where the real work begins—understanding all the ways things can and do go wrong in the real world so you build data pipelines that function in reality.
Ten engineering strategies for designing, building, and managing a data pipeline
Below are ten strategies for how to build a data pipeline drawn from dozens of years of our own team’s experiences. We have included quotes from data engineers which have mostly been kept anonymous to protect their operations.
1. Understand the precedent
Before you do anything, spend time understanding what came before. Know the data models of the systems that preceded yours, know the quirks of the systems you’re pulling from and importing to, and know the expectations of the business users. Call it a data audit and record your findings along with a list of questions that still need answering.
For example, at a large retailer, the most exciting thing isn’t the tool that works by itself but the one that works cooperatively with a legacy architecture and helps you migrate off it. It’s not uncommon for these teams to have 10,000 hours of work invested in some of their existing products. If someone tries something new and it fails in a big way, they may lose their job. Given the option, most would rather not touch it. For them, compatibility is everything, and so they must first understand the precedent.
2. Build incrementally
Build pieces of your pipeline as you need them in a modular fashion that you can adjust. The reason is, you won’t know what you need until you build something that doesn’t quite suit your purpose. It’s one of the many paradoxes of data engineering. The requirements aren’t clear until a business user asks for a time series that they only just now realized they need, but which is unsupportable.
3. Document your goals as you go
Your goals will continue to evolve as you build. Create a shared living document (Google Docs will do) and revisit and update it. Also ask others who will be involved in the pipeline, upstream or downstream, to document their goals as well. In our experience, everyone is going to tend to presume others are thinking what they’re thinking. It’s only by documenting that you realize someone wants a metric that, say, includes personally identifiable information (PII) and so is not allowed.
4. Build to minimize cost
Costs will always be higher than you expect. We have never met an engineer who said, “And to our great surprise, it cost half as much as we first thought.” When planning spend, all the classic personal finance rules apply: Overestimate costs by 20%, don’t spend what you don’t yet have, avoid recurring costs, and keep a budget.
If there are components that will need to grow exponentially, and you can pull them off of a paid platform and do it for (nearly) free, that may be the key to you accomplishing twice as much with this pipeline, and to building more.
Even as data lake providers launch features like cost alerts budgetary kill-switches, the principle remains: Build to minimize cost from the very beginning.
5. Identify the stakes and tolerance
High stakes and low tolerance systems require careful planning. For example, a rocket going into space with human lives onboard. But in the data world, most decisions are reversible. That means it can often be cheaper in terms of your time and effort to simply try it and revert rather than agonizing for weeks while deciding.
For an ecommerce company, the stakes might at first seem low. But after talking to business users, you might learn that the downstream effects of a data error could make millions of products appear available in a store when they’re not, creating a web of errors and missed expectations you can’t easily untangle.
Knowing the stakes and tolerance tells you how much “breaking” you can afford to do.
6. Organize in functional work groups
Create working groups that include an analyst, a data scientist, an engineer, and possibly someone from the business side. Have them focus on problems as a unit. It’s far more effective. If they simply worked sequentially, tossing requirements over the fence to one another, everyone would eventually grow frustrated, there’d be a lot of inefficient ‘work about work,’ and things would take forever. Functional groups tend to build better data pipelines that cost less.
This approach also gives data engineers a seat at the table when decisions are being made so they can vet ideas at the outset. If all they do is wait for notebooks from the data scientist, they’ll often discover they don’t work and they’ll either have to send it back or rewrite it themselves. Or, they’ll find that other teams continuously ask for columns that are derivable from other data, but which must be transformed.
“A constant challenge is ensuring my data engineers have a good contract with data scientists and know how to take products from them and smoothly integrate them into the system. Even with pods, it’s not always smooth.”-Data Engineering Team Lead
7. Implement monitoring and observability data pipeline tools
Some tools help you keep costs low, and observability tools fall into that category. They provide instrumentation to help you understand what’s happening within your pipeline. Without highly specific answers to questions around why data pipelines fail, you can spend an inordinate amount of time diagnosing the proximal and root causes of pipeline issues.
“Observability” is a bit of a buzzword these days, but it serves as an umbrella term to encompass:
- Monitoring—a dashboard that provides an operational view of your pipeline or system
- Alerting—alerts, both for expected events and anomalies
- Tracking—ability to set and track specific events
- Comparisons—monitoring over time, with alerts for anomalies
- Analysis—anomaly detection that adapts to your pipeline and data health
- Next best action—recommended actions to fix errors
8. Use a decision tree to combat tool sprawl
Nobody wants yet another point-solution tool that you then have to maintain. Create a decision tree for your team to decide when it makes sense to add another tool versus adjust an existing one, or evaluate a platform that would consolidate several functions. It’s good for data quality too. The fewer moving pieces, the less there is to diagnose.
9. Build your pipeline to control for all four dimensions of data quality
We’ve published a model for the four dimensions of data quality that matter to engineers—fitness, lineage, governance, and stability. These dimensions must exist in equilibrium, and you cannot maintain quality without addressing all four.
10. Document things as a byproduct of work
Also known as, “knowledge-centered service,” you should be in the habit of documenting what you do, and at the very least, keeping a log your team can access. The highest achievement for a data engineer is not being a hero that the entire company depends on, but constructing a system that’s so durable it outlasts you. Documentation should be intrinsic to your work.
Sometimes, you need to move fast and break things to meet a deadline. While that may make your data consumers happy in the short term, they won’t be happy when it all comes crumbling down underneath the weight of technical debt. More often than not, a little planning upfront and following these best practices can avoid a lot of headaches down the road. Though, these best practices won’t help you avoid the fickle nature of data altogether; you would need a data observability platform to catch anomalies & data quality issues as they crop up.
Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. To learn more about Databand and how our platform helps data engineers with their data pipelines, request a demo!