Perhaps the trickiest part about managing data pipelines is understanding the ghost in the machine—the data ex machina, if you will.
Many pipelines have what feel like personalities. They’re fickle. They mysteriously crash when there’s bad weather. They generate consistently wrong outputs and maddeningly inconsistent times. Some of the issues seem entirely unsolvable.
That’s a big part of why our founders started Databand—to give data engineers visibility into data issues. Everyone wants faster answers to questions like, “Why did we get a runtime error?” or “Why is the job still stuck in the queue?” Often, nobody knows.
But with an observability platform, you can tell. You can finally conduct thorough root cause analysis (RCA) in the moment—and not add another ticket to your towering backlog or leave data debt that you know will come back to bite.
In this guide, we’ll share some of the most common data issues we see when people run pipelines, and some of the root causes that were behind them.
Proximal versus root causes for data issues
How do you fix data quality issues? It starts with knowing that what separates remarkable data engineers from the rest is their ability to seek out the root cause of data issues. Anyone can reset the pipeline, shrug, and resume work. Very few play detective to get to the bottom of the issue, though that’s what’s needed.
It’s the difference between being satisfied with proximal causes or root causes. Proximal causes are the things that appear to have gone wrong—like a runtime error. The root cause is the thing that caused the proximal cause, and it’s much more difficult to suss out. Sometimes proximal causes are root causes, but rarely.
Think of proximal causes as mere alerts. They’re telling you that somewhere in your pipeline is a root error. Ignore it at your own peril, because that data debt compounds.
Common proximal causes (aka common examples of data problems)
When it rains, it pours, and when you have one issue, you tend to have many. Below are common possibilities of proximal data issues—these issues are not mutually exclusive, and the list is far from exhaustive:
- The schedule changed
- The pipeline timed out
- A job got stuck in a queue
- There was an unexpected transformation
- A specific run failed (perhaps it fails right as it starts)
- The run took abnormally long
- There was a system-wide failure
- There was a transformation error
- Many jobs failed jobs the night prior
- There was an anomalous input size
- There was an anomalous output size
- There was an anomalous run time
- A task stalled unexpectedly
- There was a runtime error
But that isn’t all, is it? Again, think of these not as issues, but signals. These are all the things that can go wrong that signify something more troubling has occurred. Many will appear concurrently.
An observability platform can be really helpful in sorting through them. It’ll allow you to group co-occurring issues to make sense of them.
You can also group issues according to the dimension of data quality they aggregate up to—such as fitness, lineage, governance, or stability. Grouping data issues this way shows you the dimensions along which you’re having the most issues along, and can put what seem like isolated issues into context.
And of course, you don’t have to wait for a job to fail to try this, either. If you have Databand, it lets you retroactively investigate anomalies (it captures all that historical metadata) so you can get clear on what’s casual and what’s merely correlated.
This is how you can pick out an issue like a task stalling from among a dozen errors, and test over many issues that the root cause is probably a cluster provisioning failure. And that’s how you should look at it. Always be hunting for that root cause of the data issue.
The 15 most common root causes
Root causes are the end of the road. They should be the original event in the line of causation—the first domino, as it were—and mostly explain the issue. If that root cause of the data issue doesn’t occur, neither should any of the proximal causes. It is directly causal to all of them.
Root causes, of course, aren’t always clear, and correlations aren’t always exact. If you aren’t feeling confident about your answer, a probabilistic way to tease out your true confidence score is to try this thought experiment: Say your boss tells you your team will go all-in on your hypothesis and nobody’s going to check it before it goes into production, and your name will be all over it. If it’s wrong, it’s all your fault. What 0-100 confidence score would you give your hypothesis? If it’s lower than 70, keep investigating.
Common root cause data issues include:
1. User error: We’ll start with user errors because they’re common. Perhaps someone entered the wrong schema or wrong value, which means the pipeline doesn’t read the data, or did the right thing with incorrect values, and now you have a task failure.
2. Improperly labeled data: Sometimes rows shift on a table and the right labels get applied to the wrong columns.
3. Data partner missed a delivery: Also very common. You can build a bulletproof system but you can’t control what you can’t see and if the data issues are in the source data, it’ll cause perfectly good pipelines to misbehave.
4. There’s a bug in the code: This is common when there’s a new version of the pipeline. You can figure this out pretty quickly with versioning software like Git or GitLab. Compare the production code to a prior version and run a test with that prior version.
5. OCR data error: Your optical scanner reads the data wrong, leading to strange (or missing) values.
6. Decayed data issue: The dataset is so out of date as to be no longer valid.
7. Duplicate data issue: Often, a vendor was unable to deliver data, and so the pipeline ran for last week’s data.
8. Permission issue: The pipeline failed because the system lacked permission to pull the data, or conduct a transformation.
9. Infrastructure error: Perhaps you maxed out your available memory or API call limit, your Apache Spark cluster didn’t run, or your data warehouse is being uncharacteristically slow, causing the run to proceed without the data.
10. Schedule changes: Someone (or something) changed the scheduling and it causes the pipeline to run out of order, or not run.
11. Biased data set: Very tricky to sort out. There’s no good way to suss this out except by running some tests to see if the data is anomalous compared to a similar true data set, or figuring out how it was collected or generated.
12. Orchestrator failure: Your pipeline scheduler failed to schedule or run the job.
13. Ghost in the machine (data ex machina): It’s truly unknowable. It’s tough to admit that’s the case, but it’s true for some things. The best you can do is document and be ready for next time when you can gather more data and start to draw correlations.
And then, of course, there’s the reality where the root cause isn’t entirely clear. Many things are correlated, and they’re probably interdependent, but there’s no one neat answer—and after making changes, you’ve fixed the data issue, though you’re not sure why.
In those cases, as with any, note your hypothesis in the log, and when you can return to it, continue testing historical data, and be on the lookout for new issues and more explanatory causes.
Putting it into practice to reduce data issues
The characteristic that most separates the amateur data engineer from the expert is their ability to sort out root causes, and their comfort with ambiguous answers. Proximal causes are sometimes root causes, but not always. Root causes are sometimes correlated with specific proximal causes, but not always. Sometimes there’s no distinguishing between what’s data bias and what’s human error.
Great data engineers know their pipelines are fickle, and sometimes have personalities. But they’re attuned to them, have tools to measure them, and are always on the hunt for a more reliable explanation.