How to evade tech debt jail while building data pipelines

2021-08-02 09:53:55

Data tech debt is something you can feel without ever having to see it. 

You can get a gut feel that the data pipelines are wonky—that transformations are occasionally inexplicable, or that the resultant values “seem off.” You know there’s a ghost in the machine. But it’s only with observability tools that you can find the root causes. And once you identify them, you’ll realize that “debt” is a nearly perfect analogy. 

Like monetary debt, tech debt always sounds like a good idea at the time. You think, “I’ll get the thing now, pay for it later, and we’ll be fine.” But you can never understand the true cost upfront. Delayed decisions saddle your future self with compounding obligations. After not too long, you can find yourself in data debt jail: forever paying the interest in the form of trouble tickets, unable to address the principal. 

So how do you avoid tech debt while building data pipelines? You start by understanding how it gets in there in the first place.

What is the purpose of a data pipeline? 

To understand how tech debt works its way into data pipelines, it’s helpful to return to the basics. A data pipeline is a set of programmed actions that extract data from a source and transform and load it into another. It’s data in motion. And things in motion have a habit of shifting.

Anywhere there are variable elements, and you’re time-constrained and forced to make tradeoffs, debt can slip in. 

The three big areas of data pipeline variability are:

  1. The data pipeline changes the data: Transformations are inherent—whether before or after you load—and those changes introduce the potential for error.
  2. The data itself changes: If a partner changes their API or schema, that data may not be delivered, or delivered wrong. 
  3. The data pipeline itself changes: You’ll likely develop and improve your pipeline. Each new stage, transformation, or source introduces the potential for error.

And, all these changes collide and interact, like molecules in a storm cell. They create a system that’s not merely complicated (many moving parts) but complex (many interrelated parts). That makes pipelines a recursive problem, compounded by people.

If an error introduced at extraction leads to null values that lead to wrong (but not incomplete) data on an end dashboard, someone may make a decision based on it. Let’s say someone on the product team pulls a proverbial “break glass in case of emergency” lever and calls everyone on duty to react to a steep usage dropoff that didn’t really happen. Now, you cannot simply fix the extraction error. You have to fix your system to guard against such errors, but also accept that your product team now has data trust issues. Those issues may make future alerts ineffective, and if people aren’t using the data, the data and pipeline system can decay. The errors compound and cascade. 

For that reason, knowing exactly what goes wrong and where, and catching it early (and preempting it before anyone else knows) is core to building data pipelines.

pipeline comparison tech data databand

How do you create reliable data pipelines? 

To create reliable data pipelines, we recommend following five steps in the planning stages. Debt is easiest to eradicate before it exists. One of the most helpful places to begin? By drawing your data pipeline.

1. First, diagram your strategy

Draw a diagram of your data pipeline architecture, whether in PowerPoint, Miro, or on actual, physical paper. The value of this exercise is you may find that some areas are difficult to draw. Perhaps you leave a big question mark. Those are areas to investigate. What are the hidden dependencies? What’s missing from your understanding? 

Specifically, use that diagram to define:

  • The questions users can answer with this data
  • What exists upstream
  • Dependencies at each stage
  • Systems and tools at each stage (current or desired)
  • Functional changes
  • Non-functional changes
  • Data owners at each stage (and who needs to be notified)
  • Other considerations when building data pipelines

Don’t get too caught up comparing your data pipeline architecture diagram to someone else’s at a different company. Each is as unique as each business. As outdoor adventurers say, there is no bad weather—only bad equipment. In data engineering, there are no bad data pipeline tools—only wrong applications. Don’t hate the tool, hate the use case.

Pictured, an example of different tools you can use at different stages of your data stack:

data pipeline architecture apache airflow chart databand

2. Build for data quality

Building data pipelines for data quality means starting with the assumption that your pipeline needs to guarantee data fitness, lineage, governance, and stability. This is not a common approach. Without the understanding that quality matters most, teams tend to build data pipelines for throughput. It’s, “Can we get the data there?” not “Can we get high-quality data there?”

Thinking about quality can encourage you to think differently about the importance of storing events and states as compared to latency. Building for quality starts in your data architecture diagram.

3. Build for continuous integration and deployment (CI/CD)

Testing is cheap and collaboration and versioning tools like Git and GitLab mean you really have no excuse not to practice CI/CD when building data pipelines. It’s a best practice, and given the temporal and chaotic nature of data quality issues, debt will accrue while you’re waiting for the release window.

4. Build to debug the process, not just the code

Build full pipeline observability into the architecture from day one. As we’ve discussed before, “Building an airplane while in flight” is not the right analogy. “Building an architecture” is. Your pipeline needs to be built to track and monitor every component so you can isolate incidents. You need context for system metrics and a deeper view of operations. You need alerts for when things go wrong, in Slack or via PagerDuty, so you can address and correct them before the debt accrues. 

Specifically, an observability tool (like Databand) can provide: 

  • Alerts on performance and efficiency bottlenecks before they affect data delivery
  • Unified view your pipeline health including logs, errors, and data quality metrics
  • Seamless connection to your data stack
  • Customizable metrics and dashboards
  • Fast root cause analysis to resolve issues when they are found
  • Insights into building data pipelines

When you can see everything in your data pipeline, you’re more likely to identify data tech debt early. Which, while initially a lot more work, is a big time-saver. That upfront price of addressing errant transformations or code snippets pays dividends. It’ll mean you won’t discover the foundation is cracked until after the entire company relies on it. 

Krishna Puttaswamy and Suresh Srinivas on Uber’s data engineering team explain it this way:

“While services and service quality tend to get more focus due to immediate visibility in failures/breakage, data and related tools often tend to take a backseat. But fixing them and bringing them on par with the level of rigor in service tooling/management becomes extremely important at scale, especially if data plays a critical role in product functionality and innovation.”

5. Front-load the difficult decisions

Take a tip from couples therapy: Address things in the moment, as they arise. Don’t let issues fester. As part of your data operations manifesto, announce that you’ll never put off a difficult decision because you understand the compounding cost. Make that public, make it part of your culture, and make it a reality. 

This is not to say you can’t run tests. If two technologies seem like equivalents and the decision is reversible, just try it. But where it isn’t reversible, and to build everything on top of a component you’re selecting would restrict your future choices, take the time. 

Ask leadership for the latitude to take time to make difficult decisions, so you make them early, and don’t put things off. Delaying decisions saddles you with future obligations and that is the source of nearly all data tech debt.

Building data pipeline architectures to be tech debt free

The best data pipelines are built by the experienced. It helps to have placed pipelines into production and felt the fear of failure to know what it takes to build good ones. Mistakes are the best teacher. But, you can avoid many of them all the same by knowing your sources of variability, documenting carefully, and following the five steps outlined above. 

If you diagram, build for quality, integrate continuously, implement an observability tool, debug the process itself, and front-load difficult decisions, you’re far better off than most.

And remember. There’s no cognitive error more common in engineering than shackling your future self with all manner of obligations because you took shortcuts. We always imagine our future selves to have a lot more free time than our present selves. But it ends up, they’re a lot like us. You’ll be just as busy then, if not more so. Do yourself a favor and stay out of data tech debt jail. Go slow to go fast when building data pipelines.

An 11-point checklist for setting and hitting data SLAs (with an SLA template)

Read next blog