If you’re like us, schematics for an ideal data pipeline are nice, but not always helpful. The gap between theory and practice is vast and it’s common for people to make suggestions online irrespective of realities like, say, budget. Or, without knowing that you rarely spin up data pipeline architecture entirely from scratch.
Far more common is to inherit a lovely data pipeline mess. Or if you do get to build something from scratch, you’re working under serious constraints. For example, one software engineer who was tasked with spinning up a data division at a company learned the company didn’t have any user-level reporting within its product.
Once acquired, their new parent company began asking why this didn’t exist. They tasked him with building it, but added, “But don’t spend any money.”
So he Googled the problem. He read books. He read the books mentioned in the books he read. He spent time in Stack Overflow and studied Python. And he accomplished the task, for probably one one-hundredth the storage cost he might otherwise have spent if he’d just built on what was provided by Azure.
In this guide, we share advanced-level strategies for managing data pipelines in the real world, so you appear to be the data ninja your team already thinks you are.
What is a data pipeline?
This article is for data engineers, so let’s begin with a tailored definition of what a data pipeline means for us. A data pipeline is a series of connected processes that moves data from one point to another, possibly transforming it along the way. It’s linear, with sequential and sometimes parallel executions. The analogy—“a pipeline”—is also helpful in understanding why pipelines that move data can be so difficult to build and maintain.
Sometimes you come in one morning and the pipeline is down and you have no idea why. Sometimes someone knows why and you fix it in minutes. More often, the diagnosis takes days.
The challenge is the way pipelines tend to be built—opaque—just like real-life oil-bearing pipelines. You can’t peer inside. If there’s a leak somewhere, or a screwy transformation, it takes a lot of time to figure out where that’s happening, or whether the pipeline is even responsible. What if it’s an issue with an upstream data provider? Or if it is indeed your issue, are you uncovering proximal causes or root ones?
The issue is often several degrees deep. If your scheduler runs and tells you it was successful, but all values are missing, you have an issue. When you dig in, perhaps a Spark job failed. But why did it fail? This is where the real work begins—understanding all the ways things can and do go wrong in the real world so you build data pipelines that function in reality.
So let’s recap. What is the purpose of a data pipeline? To move your company’s data. And how should a data pipeline work? Predictably, only changing it in expected ways. A pipeline should be designed from the ground up to maintain data quality, or data health, along the four dimensions that matter: fitness, lineage, governance, and stability. Next, we provide ten strategies for doing this in the real world.
Ten strategies for how to build and manage a data pipeline
Below are ten strategies for how to build a data pipeline drawn from dozens of years of our own team’s experiences. We have included quotes from data engineers which have mostly been kept anonymous to protect their operations.
1. Understand the precedent
Before you do anything, spend time understanding what came before. Know the data models of the systems that preceded yours, know the quirks of the systems you’re pulling from and importing to, and know the expectations of the business users. Call it a data audit and record your findings along with a list of questions that still need answering.
For example, at a large retailer, the most exciting thing isn’t the tool that works by itself but the one that works cooperatively with a legacy architecture and helps you migrate off it. It’s not uncommon for these teams to have 10,000 hours of work invested in some of their existing products. If someone tries something new and it fails in a big way, they may lose their job. Given the option, most would rather not touch it. For them, compatibility is everything, and so they must first understand the precedent.
2. Build incrementally
Build pieces of your pipeline as you need them in a modular fashion that you can adjust. The reason is, you won’t know what you need until you build something that doesn’t quite suit your purpose. It’s one of the many paradoxes of data engineering. The requirements aren’t clear until a business user asks for a time series that they only just now realized they need, but which is unsupportable.
3. Document your goals as you go
Your goals will continue to evolve as you build. Create a shared living document (Google Docs will do) and revisit and update it. Also ask others who will be involved in the pipeline, upstream or downstream, to document their goals as well. In our experience, everyone is going to tend to presume others are thinking what they’re thinking. It’s only by documenting that you realize someone wants a metric that, say, includes personally identifiable information (PII) and so is not allowed.
4. Build to minimize cost
Costs will always be higher than you expect. We have never met an engineer who said, “And to our great surprise, it cost half as much as we first thought.” When planning spend, all the classic personal finance rules apply: Overestimate costs by 20%, don’t spend what you don’t yet have, avoid recurring costs, and keep a budget.
If there are components that will need to grow exponentially, and you can pull them off of a paid platform and do it for (nearly) free, that may be the key to you accomplishing twice as much with this pipeline, and to building more.
Even as data lake providers launch features like cost alerts budgetary kill-switches, the principle remains: Build to minimize cost from the very beginning.
5. Identify the stakes and tolerance
High stakes and low tolerance systems require careful planning. For example, a rocket going into space with human lives onboard. But in the data world, most decisions are reversible. That means it can often be cheaper in terms of your time and effort to simply try it and revert rather than agonizing for weeks while deciding.
For an ecommerce company, the stakes might at first seem low. But after talking to business users, you might learn that the downstream effects of a data error could make millions of products appear available in a store when they’re not, creating a web of errors and missed expectations you can’t easily untangle.
Knowing the stakes and tolerance tells you how much “breaking” you can afford to do.
6. Organize in functional work groups
Create working groups that include an analyst, a data scientist, an engineer, and possibly someone from the business side. Have them focus on problems as a unit. It’s far more effective. If they simply worked sequentially, tossing requirements over the fence to one another, everyone would eventually grow frustrated, there’d be a lot of inefficient ‘work about work,’ and things would take forever. Functional groups tend to build better data pipelines that cost less.
This approach also gives data engineers a seat at the table when decisions are being made so they can vet ideas at the outset. If all they do is wait for notebooks from the data scientist, they’ll often discover they don’t work and they’ll either have to send it back or rewrite it themselves. Or, they’ll find that other teams continuously ask for columns that are derivable from other data, but which must be transformed.
“A constant challenge is ensuring my data engineers have a good contract with data scientists and know how to take products from them and smoothly integrate them into the system. Even with pods, it’s not always smooth.”-Data Engineering Team Lead
7. Implement monitoring and observability data pipeline tools
Some tools help you keep costs low, and observability tools fall into that category. They provide instrumentation to help you understand what’s happening within your pipeline. Without highly specific answers to questions around why data pipelines fail, you can spend an inordinate amount of time diagnosing the proximal and root causes of pipeline issues.
“Observability” is a bit of a buzzword these days, but it serves as an umbrella term to encompass:
- Monitoring—a dashboard that provides an operational view of your pipeline or system
- Alerting—alerts, both for expected events and anomalies
- Tracking—ability to set and track specific events
- Comparisons—monitoring over time, with alerts for anomalies
- Analysis—anomaly detection that adapts to your pipeline and data health
- Next best action—recommended actions to fix errors
8. Use a decision tree to combat tool sprawl
Nobody wants yet another point-solution tool that you then have to maintain. Create a decision tree for your team to decide when it makes sense to add another tool versus adjust an existing one, or evaluate a platform that would consolidate several functions. It’s good for data quality too. The fewer moving pieces, the less there is to diagnose.
9. Build your pipeline to control for all four dimensions of data quality
We’ve published a model for the four dimensions of data quality that matter to engineers—fitness, lineage, governance, and stability. These dimensions must exist in equilibrium, and you cannot maintain quality without addressing all four.
10. Document things as a byproduct of work
Also known as, “knowledge-centered service,” you should be in the habit of documenting what you do, and at the very least, keeping a log your team can access. The highest achievement for a data engineer is not being a hero that the entire company depends on, but constructing a system that’s so durable it outlasts you. Documentation should be intrinsic to your work.
Want to learn more about how Databand.ai can help you manage data pipelines? Request a demo to see the product in action!