October 6, 2021 By Databand 5 min read


Perhaps the trickiest part about managing data pipelines is understanding the ghost in the machine—the data ex machina, if you will.

Many pipelines have what feel like personalities. They’re fickle. They mysteriously crash when there’s bad weather. They generate consistently wrong outputs and maddeningly inconsistent times. Some of the issues seem entirely unsolvable.

That’s a big part of why IBM® Databand® exists—to give data engineers visibility into data issues. Everyone wants faster answers to questions like, “Why did we get a runtime error?”  or “Why is the job still stuck in the queue?” Often, nobody knows.

But with an observability platform, you can tell. You can finally conduct thorough root cause analysis (RCA) in the moment—and not add another ticket to your towering backlog or leave data debt that you know will come back to bite.

In this guide, we’ll share some of the most common data issues we see when people run pipelines, and some of the root causes that were behind them.

Proximal versus root causes for data issues

How do you fix data quality issues? It starts with knowing that what separates remarkable data engineers from the rest is their ability to seek out the root cause of data issues. Anyone can reset the pipeline, shrug, and resume work. Very few play detective to get to the bottom of the issue, though that’s what’s needed.

It’s the difference between being satisfied with proximal causes or root causes. Proximal causes are the things that appear to have gone wrong—like a runtime error. The root cause is the thing that caused the proximal cause, and it’s much more difficult to suss out. Sometimes proximal causes are root causes, but rarely.

Think of proximal causes as mere alerts. They’re telling you that somewhere in your pipeline is a root error. Ignore it at your own peril, because that data debt compounds.

Common proximal causes (common examples of data problems)

When it rains, it pours, and when you have one issue, you tend to have many. Below are common possibilities of proximal data issues—these issues are not mutually exclusive, and the list is far from exhaustive:

  • The schedule changed
  • The pipeline timed out
  • A job got stuck in a queue
  • There was an unexpected transformation
  • A specific run failed (perhaps it fails right as it starts)
  • The run took abnormally long
  • There was a system-wide failure
  • There was a transformation error
  • Many jobs failed jobs the night prior
  • There was an anomalous input size
  • There was an anomalous output size
  • There was an anomalous run time
  • A task stalled unexpectedly
  • There was a runtime error

But that isn’t all, is it? Again, think of these not as issues, but signals. These are all the things that can go wrong that signify something more troubling has occurred. Many will appear concurrently.
An observability platform can be really helpful in sorting through them. It’ll allow you to group co-occurring issues to make sense of them.

You can also group issues according to the dimension of data quality they aggregate up to—such as fitness, lineage, governance, or stability. Grouping data issues this way shows you the dimensions along which you’re having the most issues along, and can put what seem like isolated issues into context.

And of course, you don’t have to wait for a job to fail to try this, either. If you have Databand, it lets you retroactively investigate anomalies (it captures all that historical metadata) so you can get clear on what’s casual and what’s merely correlated.

This is how you can pick out an issue like a task stalling from among a dozen errors, and test over many issues that the root cause is probably a cluster provisioning failure. And that’s how you should look at it. Always be hunting for that root cause of the data issue.

The 15 most common root causes

Root causes are the end of the road. They should be the original event in the line of causation—the first domino, as it were—and mostly explain the issue. If that root cause of the data issue doesn’t occur, neither should any of the proximal causes. It is directly causal to all of them.

Root causes, of course, aren’t always clear, and correlations aren’t always exact. If you aren’t feeling confident about your answer, a probabilistic way to tease out your true confidence score is to try this thought experiment: Say your boss tells you your team will go all-in on your hypothesis and nobody’s going to check it before it goes into production, and your name will be all over it. If it’s wrong, it’s all your fault. What 0-100 confidence score would you give your hypothesis? If it’s lower than 70, keep investigating.

Common root cause data issues include:

1. User error: We’ll start with user errors because they’re common. Perhaps someone entered the wrong schema or wrong value, which means the pipeline doesn’t read the data, or did the right thing with incorrect values, and now you have a task failure.

2. Improperly labeled data: Sometimes rows shift on a table and the right labels get applied to the wrong columns.

3. Data partner missed a delivery: Also very common. You can build a bulletproof system but you can’t control what you can’t see and if the data issues are in the source data, it’ll cause perfectly good pipelines to misbehave.

4. There’s a bug in the code: This is common when there’s a new version of the pipeline. You can figure this out pretty quickly with versioning software like Git or GitLab. Compare the production code to a prior version and run a test with that prior version.

5. OCR data error: Your optical scanner reads the data wrong, leading to strange (or missing) values.

6. Decayed data issue: The dataset is so out of date as to be no longer valid.

7. Duplicate data issue: Often, a vendor was unable to deliver data, and so the pipeline ran for last week’s data.

8. Permission issue: The pipeline failed because the system lacked permission to pull the data, or conduct a transformation.

9. Infrastructure error: Perhaps you maxed out your available memory or API call limit, your Apache Spark cluster didn’t run, or your data warehouse is being uncharacteristically slow, causing the run to proceed without the data.

10. Schedule changes: Someone (or something) changed the scheduling and it causes the pipeline to run out of order, or not run.

11. Biased data set: Very tricky to sort out. There’s no good way to suss this out except by running some tests to see if the data is anomalous compared to a similar true data set, or figuring out how it was collected or generated.

12. Orchestrator failure: Your pipeline scheduler failed to schedule or run the job.

13. Ghost in the machine (data ex machina): It’s truly unknowable. It’s tough to admit that’s the case, but it’s true for some things. The best you can do is document and be ready for next time when you can gather more data and start to draw correlations.

And then, of course, there’s the reality where the root cause isn’t entirely clear. Many things are correlated, and they’re probably interdependent, but there’s no one neat answer—and after making changes, you’ve fixed the data issue, though you’re not sure why.

In those cases, as with any, note your hypothesis in the log, and when you can return to it, continue testing historical data, and be on the lookout for new issues and more explanatory causes.

Putting it into practice to reduce data issues

The characteristic that most separates the amateur data engineer from the expert is their ability to sort out root causes, and their comfort with ambiguous answers. Proximal causes are sometimes root causes, but not always. Root causes are sometimes correlated with specific proximal causes, but not always. Sometimes there’s no distinguishing between what’s data bias and what’s human error.

Great data engineers know their pipelines are fickle, and sometimes have personalities. But they’re attuned to them, have tools to measure them, and are always on the hunt for a more reliable explanation.

 

See how IBM Databand provides data pipeline monitoring to quickly detect data incidents like failed jobs and runs so you can handle pipeline growth. If you’re ready to take a deeper look, book a demo today.

Was this article helpful?
YesNo

More from Databand

IBM Databand achieves Snowflake Ready Technology Validation 

< 1 min read - Today we’re excited to announce that IBM Databand® has been approved by Snowflake (link resides outside ibm.com), the Data Cloud company, as a Snowflake Ready Technology Validation partner. This recognition confirms that the company’s Snowflake integrations adhere to the platform’s best practices around performance, reliability and security.  “This is a huge step forward in our Snowflake partnership,” said David Blanch, Head of Product for IBM Databand. “Our customers constantly ask for data observability across their data architecture, from data orchestration…

Introducing Data Observability for Azure Data Factory (ADF)

< 1 min read - In this IBM Databand product update, we’re excited to announce our new support data observability for Azure Data Factory (ADF). Customers using ADF as their data pipeline orchestration and data transformation tool can now leverage Databand’s observability and incident management capabilities to ensure the reliability and quality of their data. Why use Databand with ADF? End-to-end pipeline monitoring: collect metadata, metrics, and logs from all dependent systems. Trend analysis: build historical trends to proactively detect anomalies and alert on potential…

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

4 min read - What are DataOps tools? DataOps, short for data operations, is an emerging discipline that focuses on improving the collaboration, integration and automation of data processes across an organization. DataOps tools are software solutions designed to simplify and streamline the various aspects of data management and analytics, such as data ingestion, data transformation, data quality management, data cataloging and data orchestration. These tools help organizations implement DataOps practices by providing a unified platform for data teams to collaborate, share and manage…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters