The best data quality framework for senior platform engineers
In many ways, you’re only ever as good as your last delivery, and for many of us, continuous delivery means continuous scrutiny. You have to keep up quality, but also the perception of quality, because once the data trust is broken, your job becomes much more difficult.
That’s why any organization that considers data important to the functioning of its business—whether internal consumers or external—needs to be practicing data quality management and implementing a data quality framework. This is what it sounds like: Developing repeatable, ideally automatic processes and patterns to ensure that the data entering your system and being delivered downstream is what you and your consumers expect.
And as you senior data engineers well know, understanding those expectations is half the battle. Much of the other half is spent translating those expectations into tracking and alerting that will help you find and fix issues in complicated ingestion processes.
In this guide, we share strategies to ensure that data quality management isn’t simply layered on top of your existing hard-coded processes, but is built into every DAG. To manage it well, you need to detect anomalies long before low-quality data enters your transformation layer.
What is a data quality framework?
Let’s begin with a definition. Data quality framework is a tool that an organization can use to define relevant data quality attributes and provide guidance for a data quality management process of continuously ensuring data quality meets consumers’ expectations (SLAs).
That sentence is deceptively complex, so let’s unpack it:
- You need a process: Unless you have unlimited engineer hours, a process should include repeatable and ideally automatic unit tests at every stage of your data pipeline (especially at ingestion if you want to detect issues proactively), and a workflow for dealing with data issues.
- You must be continuously ensuring: Your data quality decays in proportion to your data velocity—also known as data drift. High-velocity data of the sort many of us now deal with requires frequent checks.
- You must meet consumers’ expectations, not your own: Data quality is fundamentally a business process. Your data SLAs or “service agreements” are with consumers and nothing on the engineering side matters if data scientists can’t run their models, if customers receive inaccurate shipping delivery estimates, or if your regional vice president has to go into the board meeting empty-handed because the dashboard didn’t load.
There’s a lot that goes into delivering on the above promise, and each of those elements is rife with dependencies. For instance, if you were to ask yourself how to architect such a system, you’d be asking the following questions:
- How will you come to understand consumers’ expectations around data quality?
- How will you translate those expectations into quantifiable measures of data quality?
- How will you implement automatic measures of quality for each of your pipelines?
- How will you determine thresholds for each dimension of data quality?
- How will you alert your team when data violates those thresholds?
- What will your team do when they receive an alert?
- How will they judge the validity and urgency of the alert?
- If there is an issue, how will they identify the proximate cause(s)?
- How will they identify the root cause(s)?
- How will they let consumers know what to expect?
- How will they address the root cause?
- How will they verify that they’ve addressed the root cause?
- How do they document what’s happened to build knowledge?
Seem like a long, potentially unluckily numbered list? Never fear. You can delegate.
Question 1 is best suited for the business analyst in your pod or squad. It’s up to them to talk to the business units to decompose user stories, stated preferences, implied preferences, requests, and event post-mortems into a list of “demands” for the data. These are the qualitative expectations consumers have of the data, and it’s a bit of a two-way conversation, for they may not have the words to describe what they want exactly. (Unless your data consumers are your data scientists, which can really speed this up.)
Question 2 is for you and your data scientists to answer together (especially if they are also the consumer). Given the characteristics of your data for each pipeline, what attributes can you actually measure to further decompose the list of qualitative expectations into a list of quantitative measurements?
Depending on which data quality model you follow, there are either four or five dimensions of quality to look at. At Databand we prefer a model with four characteristics:
- Accuracy—the data reflects reality
- Integrity—quality / time
- Source—is the provider delivering on your expectations?
- Origin—where’d it come from?
- Data controls
- Data privacy
With those metrics in hand, data engineers can address Questions 3-13 and begin constructing a data quality management strategy. And before we get into precisely how to do that, it’s worth asking, why go through all this effort?
Why a data quality framework is so damn important
A few years ago, an innocuous configuration change in a major retailer’s Microsoft Dynamics CRM meant that the number of inventory displayed on each item online ceased to reflect reality. The counter simply stopped updating.
People continued to purchase, but the volume number stayed constant. By the time the data engineering team was alerted, things had gotten bad.
Most items were available for purchase online, but also for in-store pickup. Lots of people chose in-store pickup. The orders were processed, and items that did not exist were nevertheless sold. So consumers visited stores where retail associates scrambled to find substitutes or promise discounts or somehow appease them. Lines formed. Store visitors had to wait to purchase and were turned off by so many people angrily jabbing their phones. And because it took days to discover the problem and for the pipeline to be fixed, it was a few days more before things were resolved.
Factoring in loss of brand reputation, the mistake cost tens of millions, and need not have happened.
Which is all to say, data issues compound. They can be difficult to spot and address, and grow unseen. It’s easy to fall into a pattern of assuming that everything is working just because you’re still drawing some insights, even while you’re accruing an increasing amount of subterranean data debt.
Furthermore, the truest signs of data quality issues also tend to be lagging indicators. For example, consumers telling you. Or as in the previous retail CRM example, thousands of retail managers and regional vice presidents telling you. That’s bad. That means that the data has been in your system for some time and it will take days for a fix to bear results. Talk about missing consumer expectations.
This is the situation the shipping startup Shipper found itself in, and why they invested so heavily in preventing it from ever occurring. Their data engineering team delivers as near to real-time as possible data to an application that helps ecommerce vendors deliver their inventory to a shipping port. It’s not just their consumers’ expectations they have to worry about—it’s their consumers’ consumers. And when their system was sometimes two days out of date, it created cascading ripples of missed expectations. Hence, they invested heavily in data quality management and tools that could give them early warning alerts with automatic checks.
Data quality management is a way to make the data quality checks automatic and pervasive, so you’re combating the forces of entropy on your datasets and pipelines with an equal and opposite amount of force.
Building your data quality framework
Let’s return to our earlier example and list of questions. Your analysts talk to the business to collect requirements, and you receive a list of quantitative consumer expectations from your data scientists. How do you then move forward and build the system?
You draw out your data quality framework. Your framework should first and foremost acknowledge that the system is a cycle and everything you learn about consumers’ expectations, which are always evolving, should influence the system.
Let’s explore each of these stages:
- Qualify—Business analysts decompose consumers’ needs into a list of requirements
- Quantify—Data scientists decompose requirements into quantifiable measures of data quality, which at this point, are still just theoretical.
- Plan—Data engineers translate quantitative measures of data quality into checks they can run in their data pipeline observability platform. Such a platform is critical—workflow and pipeline scheduling systems like Airflow and Spark can detect issues with a pipeline itself, but not within the data, which is where most issues arise. Your engineers will need to understand what can and cannot be tracked in your system.
- Implement—Data engineers implement the tracking and test it. For a very simple example, if the data needs to all be present, and not missing any fields or columns, you can set an alert around data completeness parameters. An observability platform like Databand makes this possible, and can allow you to set up anomaly detection so you need not set every value manually.
- Manage—Data engineers backtest these alerts against historical pipeline data to verify that they indeed would have functioned as intended. If true, they place them into production along with an incident management plan for who is responsible when an alert fires, and what they’ll do when they receive that alert.
- Verify—Data engineers and data scientists confirm that having the data management framework has measurably improved performance along the desired metrics. The business analysts confirm with consumers that this is indeed the case.
And what do you do with your framework? You put it into practice.
A good data quality framework means an end to surprises
As we explored in many of our examples, the very worst indicator of a data quality issue is a lagging indicator—say, from a consumer telling you something is broken. So much of what we do in data engineering is build trust along with pipelines.
By investing in a data quality management framework that helps your team automatically identify issues, you’ll create data that’s worth trusting. And that makes your job a lot easier.