6 Pillars of Data Quality and How to Improve Your Data
See how data quality encompasses various aspects, including accuracy, consistency, completeness, reliability, uniqueness and relevance.
Data observability is a turning point for data operations and the industry. The data quality frameworks and data governance strategies that were once nice-to-have philosophies are now actionable with advances in the data observability category.
This web page will be your go-to resource for everything you need to know about data observability. You’ll learn about why data observability was created, what it means, and the different types of observability. More importantly, you can learn about a framework for data observability that you can implement in your organization and some tools that can help.
Modern data systems provide a wide variety of functionality, allowing users to store and query their data in many different ways. The more functionality you add, the more complicated it becomes to ensure that your system works correctly. This complication seems to build on itself, starting with…
In the past, data infrastructure was built to handle small amounts of data–usually operational data from a few internal data sources–and the data was not expected to change very much. Now, many data products rely on data from internal and external sources, and the sheer volume and velocity in which this data is collected can cause unexpected drift, schema changes, transformations, and delays.
Imagine this: you’re a Platform Engineering Lead for a company like Omio. Your entire business model depends on reliably ingesting data from hundreds, or thousands, of large and small transit providers who may or may not have an API every day (or more frequently).
If any of those providers miss a delivery, make breaking schema change to better fit their data model, or deliver inaccurate data, it’s on you to fix it. It’s on you to find out which data source is causing the problem, who you need to contact to get an explanation, and how you will fix it before missing your SLA. That’s a nightmare scenario.
More and more data is ingested into organizations from external data sources. That’s a massive problem for data engineers.
Why? Because you can’t control your provider’s data model.
Returning to our Omio example: you are ingesting data from hundreds (or thousands) of data sources that may or may not have an API, all with different data models. This means you need to transform, structure, and aggregate all that data in all other formats to make it all usable. Even worse, if those formats change at all, it causes a domino effect of failures downstream as the strictly coded logic fails to adapt to the new schema.
Complex ingestion pipelines have created compounding headaches across the industry. Players in the industry created many interesting tools to simplify this end-to-end process. These managed tools can mostly automate the ingestion and ETL / ELT processes. Combining them together, you get a data platform the analytics industry has dubbed the “modern data stack.” The goal of the MDS is to reduce the amount of time it takes for data to be made usable for end-users (typically analysts) so they can start leveraging that data faster. But does all that automation come at a cost?
Don’t take this the wrong way; the ultimate goal of data engineering is advanced analytics. That said, for data-driven organizations, a one-size-fits-all ETL pipeline isn’t going to cut it. For these organizations, the bottom line of the business depends on the amount of control data engineers have over their data’s quality. The more automation you have, the less control you have over how data is delivered. These organizations need to build out custom data pipelines so they can better guarantee data is delivered as expected.
So while the analytics industry has been busy trying to automate away the data engineer’s job, data engineers have had limited access to tools and frameworks that make their lives easier. That is, until now.
“Data observability” is the blanket term for understanding the health and the state of data in your system. Essentially, data observability covers an umbrella of activities and technologies that, when combined, allow you to identify, troubleshoot, and resolve data issues in near real-time.
By encompassing a basket of activities, observability is much more useful for engineers. Unlike the data quality frameworks and tools that came out along with the concept of the data warehouse, it doesn’t stop at describing the problem. It provides enough context to enable the engineer to resolve the problem and start conversations to prevent that type of error from occurring again. The way to achieve this is to pull best practices from DevOps and apply them to Data Operations.
All of that to say, data observability is the natural evolution of the data quality movement, and it’s making DataOps as a practice possible. And to best define what data observability means, you where DataOps stands today and where it’s going.
Data operations (DataOps) is a workflow that enables an agile delivery pipeline and feedback loop so that businesses can create and maintain their products more efficiently. DataOps allows companies to use the same tools and strategies throughout all phases of their analytics projects, from prototyping to product deployment.
DataOps bridges the gap between data analysts and data engineers. Traditionally, these two groups had different goals and responsibilities; but this has changed with the advent of big data analytics.
By bringing the two disciplines together (data collection and data utilization), data teams could better tackle the problem: how do we improve how we manage data throughout the entire organization?
The DataOps cycle outlines the fundamental activities needed to improve how data is managed within the DataOps workflow. This cycle consists of three distinct stages: Detection, Awareness, and Iteration.
It’s important that this cycle starts with Detection because the bedrock is of the DataOps movement is founded on a data quality initiative.
How can we ensure that we can trust this data? How can we ensure that the data coming through gives us the information we need?
And while detection is an essential first step, as you’ll see, it’s the only stage that’s been possible up until now.
This first stage of the DataOps cycle is validation-focused. These include the same data quality checks that have been used since the inception of the data warehouse. They were looking at column schema and row-level validations. Essentially, you are ensuring the business rules are applied and adhered to all datasets in our system.
This data quality framework that lives in the detection stage is important but reactionary by its very nature. It’s giving you the ability to know whether the data that’s already stored in your lake or warehouse and likely already being utilized is in the form you expect.
Another important note: you are validating datasets and following business rules you know. But if you don’t know the causes of issues, you cannot establish new business rules for your engineers to follow. This realization fuels the demand for “shift-left” awareness of data issues and the development of data observability tools that make this possible.
Awareness is a visibility-focused stage of the DataOps phase. This is where the conversation around data governance comes into the picture, and a metadata-first approach is introduced. Centralizing and standardizing pipeline & dataset metadata across your organization gives teams visibility into issues within the entire organization.
The centralization of metadata is crucial to giving the entire organization awareness into the end-to-end health of their data. Doing this allows you to move to a more proactive approach to solving data issues. If there is bad data that is entering your “domain,” you can trace the error to a certain point in your system upstream. Now data engineering team A can go on to look at data engineering team B’s pipelines and be able to understand what is going on there and collaborate with them to fix the issue potentially.
The vice-versa also applies. Data engineering team B can detect an issue and trace what impact it will have downstream. Now, data engineering team A will know that an issue will happen, and they can take whatever measures are necessary to contain it.
This is the biggest area that has been lacking in DataOps. Not only is there now a universal language that all teams can point to and discuss amongst each other, but data teams can share this information with stakeholders and help them understand what they plan to do and how they intend to support the data that they need, as well as rectify any issues that come across.
Here, teams focus on data-as-code. This stage of the cycle is process-focused. Teams are ensuring that they have repeatable and sustainable standards that will be applied to all of our data development to ensure that we get the same trustworthy data at the end of those pipelines.
The gradual improvement of the data platform’s overall health is now made possible by the detection of issues, awareness of the upstream root causes, and efficient processes for iteration.
Earlier, we defined data observability as a blanket term for activities and technologies that help you understand the health and the state of data in their system. In this section, we will summarize those activities and what they accomplish.
From there, you’ll understand what organizational and technological shifts need to occur to implement a data observability framework that enables agile data operations.
To make data observability useful, it needs to include these activities:
How is this any different from the activities that data teams already do? The difference lies in how these activities fit into the end-to-end data operations workflow and the level of context they provide on data issues.
For most organizations, observability is siloed. Teams collect metadata on the pipelines they own. Different teams are collecting metadata that may not connect to critical downstream or upstream events. More importantly, that metadata isn’t visualized or reported on a dashboard that can be viewed across teams.
There may be standardized logging policies for one team, but not for another, and there’s no way for other teams to easily access them. Some teams may run algorithms on datasets to ensure they are meeting business rules. But the team that builds the pipelines doesn’t have a way to monitor how the data is transforming within that pipeline and whether it will be delivered in a form the consumers expect. The list can go on and on.
Without the ability to standardize and centralize these activities, teams can’t have the level of awareness they need to proactively iterate their data platform. A downstream data team can’t trace the source of their issues upstream, and an upstream data team can’t improve their processes without visibility into downstream dependencies.
All data observability isn’t created equal. The level of context you can achieve depends on what metadata you can collect and provide visibility on. We call this the hierarchy of data observability. Each level is a foundation for the next and allows you to attain a finer grain of observability.
Getting visibility into your operational and dataset health is a sound foundation for any data observability framework.
Monitoring dataset health refers to monitoring your dataset as a whole. You are getting awareness into the state of your data while it’s in a fixed location, which we refer to as data at rest.
This type of monitoring answers questions like:
Operational monitoring refers to monitoring the state of your pipelines. This type of monitoring gives you awareness into the state of your data while it’s transforming and moving through your pipelines. We refer to this data state as data in motion.
This type of monitoring answers questions like:
While dataset and data pipeline monitoring are usually separated into two different activities, it’s essential to keep them coupled to achieve a solid foundation of observability. These two states are highly interconnected and dependent on each other. Siloing out these two activities into different tools or teams makes it more challenging to get a high-level view of your data’s health.
Column-level profiling is key to this hierarchy. Once a solid foundation has been laid for it, column-level profiling gives you the insights you need to establish new business rules for your organization and enforce existing ones at the column level as opposed to just the row level. That level of awareness allows you to improve your data quality framework in a very actionable way.
This level of observability allows you to answer questions like:
From here, you can move up to the final level of observability: row-level validation. This looks at the values in each row and validates that they are accurate.
This type of observability looks at:
Many organizations get tunnel vision on row-level validation, but that’s really just mistaking the trees for the forest. By building your observability framework starting with Operational & Dataset monitoring, you can get big picture context on your data platform’s health while still honing in on the root cause of issues and their impact downstream.
Let’s bring this full circle: data observability is a collection of activities and technologies that help you understand the health and the state of data within your system. Data observability is a byproduct of the DataOps movement, and it has been the missing piece for making agile, iterative improvements to your data products possible.
We’ve learned that data observability isn’t a silver bullet, and neither is DataOps. Technology alone will not solve your problems. You can have the best monitoring dashboards that report on all of your metadata, equipped with the most powerful automation and algorithms. Still, without organizational adoption, it’s only good for the pipelines you own. Vice-versa, everyone can be bought into DataOps as a practice, but if you don’t have the technology to support it, it’s just nice-to-have documented philosophy that doesn’t impact output.
So, how do we actually implement a data observability framework that can improve our end-to-end data quality? What metrics should we be tracking at each stage?
Here are the ingredients to a high-functioning data observability framework:
Before you can even think about producing a high-value data product, you need mass adoption of the DataOps Culture. You need everyone bought into this, but you especially need leadership bought in. They are the ones that dictate the systems and processes for development, maintenance, and feedback. As powerful as a bottom-up movement can be, you need budget approvals to make the technological changes needed to support a DataOps system.
Once everyone is bought into the idea of being efficient, leadership can move the organization toward a standardized data platform. What do we mean by that? To get end-to-end ownership and accountability across all teams, you need infrastructure in place that will allow teams to speak the same language and openly communicate about issues. That means you need standardized libraries for API & data management (i.e., querying data warehouse, read/write from data lake, pulling data from APIs, etc.). You need a standardized library for data quality. You need source code tracking, data versioning, and CI/CD processes.
With that, your infrastructure is set up for success. Now you need a unified observability platform that gives your entire organization open access to your system health. This observability platform would act as a centralized metadata repository. It would encompass all the features listed earlier (like monitoring, alerting, tracking, comparison, and analysis) so data teams could get an end-to-end view of how the sections of the platform they own are affecting other sections.
Culture? Check.
Standardized Data Platform? Check.
Unified Data Observability Platform? Check?
You have all the moving pieces in place, but what should you be tracking? Let’s refer back to the Hierarchy of Data Observability.
For simplicity’s sake, we’ve distilled this into a graphic:
For operational health, you should be collecting execution metadata. This includes metadata on pipeline states, duration, delays, retries, and times between subsequent runs.
For dataset monitoring, you should look at your dataset’s completeness, availability, volume of data in and out, and schema changes.
For column-level profiling, you should collect summary statistics on columns and use anomaly detection to alert changes. You’d be looking at Mean, Max, Min trends within columns.
For row-level validation, you’d ensure the previous checks didn’t fail at the row level and your business rules. This is very contextual, so you’ll have to use your discretion.
Data observability is the backbone of any data team’s ability to be agile and iterate on their products. Without it, a team cannot rely on its infrastructure or tools because errors can’t be tracked down quickly enough. This leads to less agility in building new features and improvements for your customers — which means you’re essentially throwing away money by not investing in this key piece of the DataOps framework! If you want to learn more about how our platform delivers complete visibility into all aspects of your system, get in touch with us today!
Additional reading & resources