In theory, data quality is everyone’s problem. When it’s poor, it degrades marketing, product, customer success, brand-perception—everything. In theory, everyone should work together to fix it. But that’s in theory.
In reality, you need someone to take ownership of the problem, investigate it, and tell others what to do. That’s where data engineers come in. Their perennial challenge? That everyone involved in using the data has a different understanding of what “data” means. And it’s not really their fault.
The further someone is from the source of that data and the data pipelines that carry it, the more they tend to engage in magical thinking about how it can be used, if only for a lack of awareness. According to one data engineer we talked to when researching this guide, “Business leaders are always asking, ‘Hey, can we look at sales across this product category?’ when on the backend, it’s virtually impossible with the current architecture.”
Similarly, businesses rely on the data from pipelines they can’t fully observe. Without accurate benchmarks or a seasoned professional who can sense that output values are off, you can be data-driven right off a cliff.
In this guide, the Databand team has compiled a resource for grappling with data quality issues within and around your pipeline – not in theory, but in practice. And that starts with a discussion of what exactly constitutes data quality for data engineers.
What are the four characteristics of data quality?
While academic conceptions of data quality provide an interesting foundation, we’ve found that for data engineers, it’s different. In diagnosing pipeline data quality issues for dozens of high-volume organizations over the last few years, engineers need a simpler and more credible map. Only with that map can you begin to conceptualize systems that will keep it in proper order.
We’ve condensed the typical 6-7 data quality dimensions (you will find hundreds of variants online) into just four. We also prefer the term “data health” to “data quality,” because it suggests it’s an ongoing system that must be managed. Without checkups, pipelines can grow sick and stop working.
Dimension 1: Fitness—is this data fit for its intended use?
The operative word here is “intended.” No two companies’ uses are identical, so fitness is always in the eye of the beholder. To test fitness, take a random sample of records and test how they perform for your intended use.
Within fitness, look at:
- Accuracy—does the data reflect reality? (Within reason. As they say, all models are wrong. Some are useful.)
- Integrity—does the fitness remain high through the data’s lifecycle? (It’s a simple equation: Integrity = quality / time)
Dimension 2: Lineage—where did this data come from? When? Where did it change? Is it where it needs to be?
Lineage is your timeline. It helps you understand whether your data health problem starts with your provider. If it’s fit when it enters your pipeline and unfit when it exits, that’s useful information.
Within lineage, look at:
- Source—is my data source provider behaving well? E.g. Did Facebook change an API?
- Origin—where did the data already in my database come from? E.g. Perhaps you’re not sure who put it there.
Dimension 3: Governance—can you control it?
These are the levers you can pull to move, restrict, or otherwise control what happens to your data. It’s the procedural stuff, like loads and transformations, but also security and access.
Within governance, look at:
- Data controls—how do we identify which data should be governed and which should be open? What should be available to data scientists and users? What shouldn’t?
- Data privacy—where is there currently personally identifiable info (PII)? Can we automatically redact PII like phone numbers? Can we ensure that a pipeline that accidentally contains PII fails or is killed?
- Regulation—can we track regulatory requirements, ensure we’re compliant, and prove we’re compliant if a regulator wants to know? (Under GDPR, CCPA, NY SHIELD, etc.)
- Security—who has access to the data? Can I control it? With enough granularity?
Dimension 4: Stability—is the data complete and available in the right frequency?
Your data may be fit, meaning your downstream systems function, but is it as accurate as it could be, and is that consistently the case? If your data is fit, but the accuracy varies widely, or it’s only available in monthly batch updates and you need it hourly, it’s not stable.
Stability is one of the biggest areas where data observability tools can help. Pipelines are often a black box unless you can monitor what happens inside and get alerts.
To check stability, check against a benchmark dataset.
Within stability, look at:
- Consistency—does the data going in match the data going out? If it appears in multiple places, does it mean the same thing? Are weird transformations happening at predictable points in the pipeline?
- Dependability—the data is present when needed. E.g. If I build a dashboard, it behaves properly and I don’t get calls from leadership.
- Timeliness—is it on time? E.g. If you pay NASDAQ for daily data, are they providing fresh data on a daily basis? Or is it an internal issue?
- Bias—is there bias in the data? Is it representative of reality?
Bias should be among your top concerns. Bias is often imperceptible to the naked eye, and can infect training data sets and so go entirely undetected. Take, for example, seasonality in the data. If you train a model for predicting consumer buying behavior and you use a dataset from November to December, you’re going to have unrealistically high sales predictions.
Put another way, if you’re trying to predict traffic time across the U.S., have systems in place that will catch when the data you get is only from downtown Manhattan.
Now, bias of this sort isn’t completely imperceptible—some observability platforms (Databand being one of them) have anomaly detection for this reason. When you have seasonality in your data, you have seasonality in your data requirements, and thus seasonality in your data pipeline behavior. You should be able to automatically account for that. Which brings us back to our earlier point—data quality is really about ongoing data pipeline health. Healthy pipelines require frequent checkups, and as we explain next, those checkups can’t be partial.
What is good data quality for data engineers?
Good data quality for data engineers is when you have a data pipeline set up to ensure all four data quality dimensions—fitness, lineage, governance, and stability. With those systems in place, you get stable data. With stable data, you get quality decisions. But you must address all four.
As a data engineer, you cannot tackle one dimension of data quality without tackling all. That may seem rather inconvenient given that most engineers are inheriting data pipelines rather than building them from scratch. But such is the reality.
If you optimize for one dimension—say, stability—you may be loading data that hasn’t yet been properly transformed, and fitness can suffer. The data quality dimensions exist in equilibrium. The best analogy we’ve come across to explain this is that of maintaining life in outer space.
On the international space station, engineers must balance oxygen levels, water, temperature, and pressure. Merely getting three out of four can be deadly. And if you over-index on one—like pressure—you throw off the others. Astronauts cannot live off temperature alone.
To achieve a proper balance for data health, you need:
Data quality controls
What systems do you have for manipulating, protecting, and governing your data? With high-volume pipelines, it is not enough to trust and verify.
Data quality testing
What systems do you have for measuring fitness, lineage, governance, and stability? Things will break. You must know where, and why.
Systems to identify data quality issues
If issues do occur—if a pipeline fails to run, or the result is aberrant—do you have anomaly detection to alert you? Or if PII makes it into a pipeline, does the pipeline auto-fail to protect you from violating regulation?
In short, you need a high level of data observability, paired with the ability to act continuously.
Common data pipeline data quality issues (data quality examples)
As a final thought, when you’re diagnosing your data pipeline issues, it’s important to draw a distinction between a problem and its root cause. Your pipeline may have failed to complete. The proximal cause could have been an error in a Spark job. But the root cause? A corruption in the dataset. If you aren’t addressing issues in the dataset, you’ll be forever addressing issues.
Examples of common data pipeline quality issues:
- Non-unicode characters
- Unexpected transforms
- Mismatched data in a migration or replication process
- Pipelines missing their SLA, or running late
- Pipelines that are too resource-intensive or costly
- Finding the root cause of issues
- Error in a Spark job, corruption in a data set
- A big change in your data volume or sizes
The more detail you get from your monitoring tool, the better. It’s common to discover proximal causes quickly, but then take days to discover the root cause through a taxing, manual investigation. Sometimes, your pipeline workflow management tool tells you everything is okay but a quick glance at the output reassures you nothing is okay, because the values are all blank. For instance, Airflow may tell you the pipeline succeeded, but no data actually passed through. Your code ran fine—Airflow gives you a green light, you’re good—but on the data level, it’s entirely unfit.
Constant checkups and being able to peer deeply into your pipeline to know the right balance of fitness, lineage, governance, and stability to produce high-quality data. And high-quality data is how you support an organization in practice, not just in theory.