Data observability: Everything you need to know
Data observability is a turning point for data operations and the industry as a whole. The data quality frameworks and data governance strategies that were once nice-to-have philosophies are now actionable with advances in the data observability category.
This web page will serve as your go-to resource for everything you need to know about data observability. You’ll learn about why data observability was created, what it means, and the different types of observability. More importantly, you can learn about a framework for data observability that you can implement in your organization and some tools that can help.
Use this table of contents to quickly jump to whichever section is most relevant to you:
The problems of modern data infrastructure
Modern data systems provide a wide variety of functionality, allowing users to store and query their data in many different ways. The more functionality you add, the more complicated it becomes to ensure that your system works correctly. This complication seems to build on itself, starting with…
More external data sources
In the past, data infrastructure was built to handle small amounts of data–usually operational data from a few internal data sources–and the data was not expected to change very much. Now, many data products rely on data from internal and external sources, and the sheer volume and velocity in which this data is collected can cause unexpected drift, schema changes, transformations, and delays.
Just imagine this: you’re a Platform Engineering Lead for a company like Omio. Your entire business model depends on being able to reliably ingest data from hundreds, or thousands, of large and small transit providers who may or may not have an API every day (or more frequently).
If any of those providers miss a delivery, make breaking schema change to better fit their data model, or deliver inaccurate data, it’s on you to fix it. It’s on you to find out which data source is causing the problem, who you need to contact to get an explanation, and how you’re going to fix it before missing your SLA. That’s a nightmare scenario.
More complicated transformations
More and more data is being ingested into organizations from external data sources. That’s a huge problem for data engineers. Why? Because you can’t control your provider’s data model.
Going back to our Omio example: you are ingesting data from hundreds (or thousands) of data sources that may or may not have an API, all with different data models. This means you need to transform, structure, and aggregate all that data in all different formats to make it all usable. Even worse, if those formats change at all, it causes a domino effect of failures downstream as the strictly coded logic fails to adapt to the new schema.
Too much focus on analytics engineering
Complex ingestion pipelines have created compounding headaches across the industry. Players in the industry created a plethora of really interesting tools to simplify this end-to-end process. These managed tools can mostly automate the ingestion and ETL / ELT processes. Combining them together, you get a data platform the analytics industry has dubbed the “modern data stack.” The goal of the MDS is to reduce the amount of time it takes for data to be made usable for end-users (typically analysts) so they can start leveraging that data faster. But does all that automation come at a cost?
Don’t take this the wrong way; the ultimate goal of data engineering is advanced analytics. That said, for data-driven organizations, a one-size-fits-all ETL pipeline isn’t going to cut it. For these organizations, the bottom line of the business depends on the amount of control data engineers have over their data’s quality. The more automation you have, the less control you have over how data is delivered. These organizations need to build out custom data pipelines so they can better guarantee data is delivered as expected.
So while the analytics industry has been busy trying to automate away the data engineer’s job, data engineers have had limited access to tools and frameworks that make their lives easier. That is, until now.
What is data observability?
“Data observability” is the blanket term for understanding the health and the state of data in your system. Essentially, data observability covers an umbrella of activities and technologies that, when combined, allow you to identify, troubleshoot, and resolve data issues in near real-time.
By encompassing a basket of activities, observability is much more useful for engineers. Unlike the data quality frameworks and tools that came out along with the concept of the data warehouse, it doesn’t stop at describing the problem. It provides enough context to enable the engineer to resolve the problem and start conversations to prevent that type of error from occurring again. The way to achieve this is to pull best practices from DevOps and apply them to Data Operations.
All of that to say, data observability is the natural evolution of the data quality movement, and it’s making DataOps as a practice possible. And to best define what data observability means, you where DataOps stands today and where it’s going.
The DataOps movement
Data operations (DataOps) is a workflow that enables an agile delivery pipeline and feedback loop so that businesses can create and maintain their own products more efficiently. DataOps allows companies to use the same set of tools and strategies throughout all phases of their analytics projects, from prototyping through productionization.
DataOps was created to bridge the gap between data analysts and data engineers. Traditionally, these two groups had different goals and responsibilities; but with the advent of big data analytics, this has changed.
By bringing the two disciplines together (data collection and data utilization), data teams could better tackle the problem: how do we improve the way we manage data throughout the entire organization?
The DataOps cycle
The DataOps cycle outlines the fundamental activities that need to occur to improve the way data is managed within the DataOps workflow. This cycle consists of three distinct stages: Detection, Awareness, and Iteration.
It’s important that this cycle starts with Detection because the bedrock is of the DataOps movement is really founded on a data quality initiative. How can we ensure that we can trust this data? How can we ensure that the data coming through is actually giving us the information that we need?
And while detection is an important first step, as you’ll see, it’s the only stage that’s been possible up until now.
This first stage of the DataOps cycle is validation-focused. These include the same data quality checks that have been used since the inception of the data warehouse. They were looking at column schema and row-level validations. Essentially, you are making sure the business rules are being applied and adhered to all datasets are coming into our system.
This data quality framework that lives in the detection stage is important, but reactionary by its very nature. It’s giving you the ability to know whether the data that’s already stored in your lake or warehouse, and likely already being utilized, is in the form you expect.
Another important note: you are validating datasets are following business rules that you know of. But if you don’t have awareness into the causes of issues, you cannot establish new business rules for your engineers to follow. This realization is fueling the demand for “shift-left” awareness of data issues and the development of data observability tools that make this possible.
Awareness is a visibility-focused stage of the DataOps phase. This is where the conversation around data governance comes into the picture, and a metadata-first approach is introduced. Centralizing and standardizing pipeline & dataset metadata across your organization gives teams visibility into issues that happen within the entire organization.
The centralization of metadata is crucial to giving the entire organization awareness into the end-to-end health of their data. By doing this, you move to a more proactive approach to solving data issues. If there is bad data that is entering your “domain,” you are able to trace the error to a certain point in your system upstream. Now data engineering team A can go on to look at data engineering team B’s pipelines and be able to understand what is going on there, and collaborate with them to potentially fix the issue.
The vice-versa also applies. Data engineering team B can detect an issue and trace what impact it will have downstream. Now, data engineering team A will know that an issue will happen, and they can take whatever measures are necessary to contain it.
This is the biggest area that has been lacking in DataOps. Not only is there now a universal language that all teams can point to and discuss amongst each other, but data teams can share this information with stakeholders and help them understand what they plan to do and how they intend to support the data that they need, as well as rectify any issues that come across.
Here, teams focus on data-as-code. This stage of the cycle is process-focused. Teams are making sure that they have repeatable and sustainable standards that will be applied to all of our data development to ensure that we’re going to get the same trustworthy data at the end of those pipelines.
The gradual improvement of the data platform’s overall health is now made possible by the detection of issues, awareness into the upstream root causes, and efficient processes for iteration.
Features of data observability
Earlier, we defined data observability as a blanket term for activities and technologies that help you understand the health and the state of data in their system. In this section, we’re going to break down what those activities are and what they accomplish.
From there, you’ll have a proper understanding of what organizational and technological shifts need to occur to implement a data observability framework that enables agile data operations.
Data observability’s makeup
To make data observability useful, it needs to include these activities:
- Monitoring—a dashboard that provides an operational view of your pipeline or system
- Alerting—both for expected events and anomalies
- Tracking—ability to set and track specific events
- Comparisons—monitoring over time, with alerts for anomalies
- Analysis—automated issue detection that adapts to your pipeline and data health
- Logging—a record of an event in a standardized format for faster resolution
- SLA Tracking—the ability to measure data quality and pipeline metadata against pre-defined standards
How is this any different from the activities that data teams already do? The difference lies in how these activities fit into the end-to-end data operations workflow and the level of context they provide on data issues.
For most organizations, observability is siloed. Teams collect metadata on the pipelines they own. Different teams are collecting metadata that may not connect to critical downstream or upstream events. More importantly, that metadata isn’t visualized or reported on a dashboard that can be viewed across teams.
There may be standardized logging policies for one team, but not for another, and there’s no way for other teams to easily access them. Some teams may run algorithms on datasets to ensure they are meeting business rules. But the team that builds the pipelines doesn’t have a way to monitor how the data is transforming within that pipeline and whether it will be delivered in a form the consumers expect. The list can go on and on.
Without the ability to standardize and centralize these activities, teams can’t have the level of awareness they need to proactively iterate their data platform. A downstream data team can’t trace the source of their issues upstream, and an upstream data team can’t improve their processes without being visibility into downstream dependencies.
Hierarchy of data observability
All data observability isn’t created equal. The level of context you are able to achieve depends on what metadata you are able to collect and provide visibility on. We call this the hierarchy of data observability. Each level acts as a foundation for the next and allows you to attain a finer grain of observability.
Operational health & dataset monitoring
Getting visibility into your operational health and your dataset health is a sound foundation for any data observability framework.
Data at rest
Monitoring dataset health refers to monitoring your dataset as a whole. You are getting awareness into the state of your data while it’s in a static location, which we refer to as data at rest.
This type of monitoring answers questions like:
- Did this dataset arrive on time?
- Is this dataset being updated as frequently as you need it to be?
- Is the expected volume of data available in this dataset?
Data in motion
Operational monitoring refers to monitoring the state of your pipelines. This type of monitoring gives you awareness into the state of your data while it’s transforming and moving through your pipelines. We refer to this data state as data in motion.
This type of monitoring answers questions like:
- How does pipeline performance affect the dataset quality?
- Under what conditions is a run considered successful?
- What operations are transforming the dataset before it reaches the lake or warehouse?
While dataset and data pipeline monitoring are usually separated into two distinct activities, it’s important to keep them coupled together to achieve a solid foundation of observability. These two states are highly interconnected and dependent on each other. Siloing out these two activities into different tools or teams makes it more difficult to get a high-level view of your data’s health.
Column-level profiling is key to this hierarchy. Once a solid foundation has been laid for it, column-level profiling gives you the insights you need to establish new business rules for you organization and enforce existing ones at the column level as opposed to just the row level. That level of awareness allows you to actually improve your data quality framework in a very actionable way.
This level of observability allows you to answer questions like:
- What is the expected range for a column?
- What is the expected schema of this column?
- How unique is this column?
From here, you can move up to the final level of observability: row-level validation. This is looks at the values in each row and validates that they are accurate.
This type of observability looks at:
- Are the values in each row in the expected form?
- Are the values the exact length you expect it to be?
- Given the context, is there enough information here to be useful to the end-user?
A lot of organizations get tunnel vision on row-level validation, but that’s really just mistaking the trees for the forest. By building your observability framework starting with Operational & Dataset monitoring, you can get big picture context on your data platform’s health while still being able to hone in on the root cause of issues and their impact downstream.
Implementing a data observability framework
Let’s bring this full circle: data observability is a collection of activities and technologies that help you understand the health and the state of data within your system. Data observability is a byproduct of the DataOps movement, and it has been the missing piece for making agile, iterative improvements to your data products possible.
What we’ve learned is that data observability isn’t a silver bullet, and neither is DataOps. Technology alone will not solve your problems. You can have the best monitoring dashboards that reports on all of your metadata equipped with the most powerful automation and algorithms, but without organizational adoption, it’s only good for the pipelines you own. Vice-versa, everyone can be bought into DataOps as a practice, but if you don’t have the technology to support it, it’s just nice-to-have documented philosophy that doesn’t impact output.
The framework’s moving pieces
So, how do we actually implement a data observability framework that can improve our end-to-end data quality? What metrics should we be tracking at each stage?
Here are the ingredients to a high-functioning data observability framework:
- DataOps Culture
- Standardized Data Platform
- Unified Data Observability Platform
Before you can even think about producing a high-value data product, you need mass adoption of the DataOps Culture. You need everyone bought into this, but you especially need leadership bought in. They are the ones that dictate the systems and processes for development, maintenance, and feedback. As powerful as a bottom-up movement can be, you need budget approvals to make the technological changes needed to support a DataOps system.
Once everyone is bought into the idea of being efficient, leadership can move the organization towards a standardized data platform. What do we mean by that? In order to get end-to-end ownership and accountability across all teams, you need infrastructure in place that will allow teams to speak the same language and openly communicate about issues. That means you need standardized libraries for API & data management (i.e. querying data warehouse, read/write from data lake, pulling data from APIs, etc.). You need a standardized library for data quality. You need source code tracking, data versioning, and CI/CD processes.
With that, your infrastructure is set up for success. Now you need a unified observability platform that gives your entire organization open access to your system health. This observability platform would act as a centralized metadata repository. It would encompass all the features listed earlier (like monitoring, alerting, tracking, comparison, analysis) so data teams could get an end-to-end view of how the sections of the platform they own are affecting other sections.
Example metrics to track
Standardized Data Platform? Check.
Unified Data Observability Platform? Check?
You have all the moving pieces in place, but what should you actually be tracking? Let’s refer back to the Hierarchy of Data Observability. For simplicity sake, we’ve distilled this into a graphic:
For operational health, you should be collecting execution metadata. This includes metadata on pipeline states, duration, delays, retries, and times between subsequent runs.
For dataset monitoring, you should be looking at the completeness of your dataset, the availability, the volume of data in and out, and schema changes.
For column-level profiling, you should be collect summary statistics on columns and use anomaly detection to alert on changes. You’d be looking at Mean, Max, Min trends within columns.
For row-level validation, you’d be ensuring the previous checks didn’t fail at the row level and your business rules. This is very contextual so you’ll have to use your discretion.
Data observability is the backbone of any data team’s ability to be agile and iterate on their products. Without it, a team cannot rely on its infrastructure or tools because errors can’t be tracked down quickly enough. This leads to less agility in building new features and improvements for your customers — which means you’re essentially throwing away money by not investing in this key piece of the DataOps framework! If you want to learn more about how our platform delivers complete visibility into all aspects of your system, get in touch with us today!
Additional reading & resources
- End-to-end data observability goes beyond your warehouse
- If bad data is in your warehouse, it’s already too late