What is Data observability? Everything you need to know.

Data observability is a turning point for data operations and the industry. The data quality frameworks and data governance strategies that were once nice-to-have philosophies are now actionable with advances in the data observability category.

What is data observability?

What is data observability?

“Data observability” is the blanket term for understanding the health and the state of data in your system. Essentially, data observability covers an umbrella of activities and technologies that, when combined, allow you to identify, troubleshoot, and resolve data issues in near real time.

By encompassing a basket of activities, observability is much more useful for engineers. Unlike the data quality frameworks and tools that came out along with the concept of the data warehouse, it doesn’t stop at describing the problem. It provides enough context to enable the engineer to resolve the problem and start conversations to prevent that type of error from occurring again. The way to achieve this is to pull best practices from DevOps and apply them to Data Operations.

All of that to say, data observability is the natural evolution of the data quality movement, and it’s making DataOps as a practice possible. And to best define what data observability means, you need to know where DataOps stands today and where it’s going.

Evolution of Data Observability

Data observability and the problems of modern data infrastructure (MDS)

Modern data systems provide a wide variety of functionality, allowing users to store and query their data in many different ways. The more functionality you add, the more complicated it becomes to ensure that your system works correctly. This complication seems to build on itself, starting with…

More external data sources require more data observability

In the past, data infrastructure was built to handle small amounts of data–usually operational data from a few internal data sources–and the data was not expected to change very much. Now, many data products rely on data from internal and external sources, and the sheer volume and velocity in which this data is collected can cause unexpected drift, schema changes, transformations, and delays.

Imagine this: you’re a Platform Engineering Lead for a company like Omio. Your entire business model depends on reliably ingesting data from hundreds, or thousands, of large and small transit providers who may or may not have an API every day (or more frequently).

If any of those providers miss a delivery, make breaking schema changes to better fit their data model, or deliver inaccurate data, it’s on you to fix it. It’s on you to find out which data source is causing the problem, who you need to contact to get an explanation, and how you will fix it before missing your SLA. That’s a nightmare scenario.

More complicated transformations

More and more data is ingested into organizations from external data sources. That’s a massive problem for data engineers. 

Why? Because you can’t control your provider’s data model.

Returning to our Omio example: you are ingesting data from hundreds (or thousands) of data sources that may or may not have an API, all with different data models. This means you need to transform, structure, and aggregate all that data in all other formats to make it all usable. Even worse, if those formats change at all, it causes a domino effect of failures downstream as the strictly coded logic fails to adapt to the new schema.

Too much focus on analytics engineering

Complex ingestion pipelines have created compounding headaches across the industry. Players in the industry created many interesting tools to simplify this end-to-end process. These managed tools can mostly automate the ingestion and ETL / ELT processes. Combining them together, you get a data platform the analytics industry has dubbed the “modern data stack.” The goal of the MDS is to reduce the amount of time it takes for data to be made usable for end-users (typically analysts) so they can start leveraging that data faster. But does all that automation come at a cost?

The more automation you have, the less control you have over how data is delivered. These organizations need to build out custom data pipelines so they can better guarantee data is delivered as expected.

Data observability and the DataOps movement

Data operations (DataOps) is a workflow that enables an agile delivery pipeline and feedback loop so that businesses can create and maintain their products more efficiently. DataOps allows companies to use the same tools and strategies throughout all phases of their analytics projects, from prototyping to product deployment.

The DataOps cycle

The DataOps cycle outlines the fundamental activities needed to improve how data is managed within the DataOps workflow. This cycle consists of three distinct stages: Detection, Awareness, and Iteration.

Data Observability - DataOps Cycle

It’s important that this cycle starts with Detection because the bedrock of the DataOps movement is founded on a data quality initiative. 

Data Observability - DataOps Cycle 2


This first stage of the DataOps cycle is validation-focused. These include the same data quality checks that have been used since the inception of the data warehouse. They were looking at column schema and row-level validations. Essentially, you are ensuring the business rules are applied and adhered to all datasets in our system.

This data quality framework that lives in the detection stage is important but reactionary by its very nature. It’s giving you the ability to know whether the data that’s already stored in your lake or warehouse and likely already being utilized is in the form you expect.

Another important note: you are validating datasets and following business rules you know. But if you don’t know the causes of issues, you cannot establish new business rules for your engineers to follow. This realization fuels the demand for “shift-left” awareness of data issues and the development of data observability tools that make this possible.


Awareness is a visibility-focused stage of the DataOps phase. This is where the conversation around data governance comes into the picture, and a metadata-first approach is introduced. Centralizing and standardizing pipeline & dataset metadata across your organization gives teams visibility into issues within the entire organization.

The centralization of metadata is crucial to giving the entire organization awareness into the end-to-end health of their data. Doing this allows you to move to a more proactive approach to solving data issues. If there is bad data that is entering your “domain,” you can trace the error to a certain point in your system upstream. Now data engineering team A can go on to look at data engineering team B’s pipelines and be able to understand what is going on there and collaborate with them to fix the issue potentially.

The vice-versa also applies. Data engineering team B can detect an issue and trace what impact it will have downstream. Now, data engineering team A will know that an issue will happen, and they can take whatever measures are necessary to contain it.


Here, teams focus on data-as-code. This stage of the cycle is process-focused. Teams are ensuring that they have repeatable and sustainable standards that will be applied to all of our data development to ensure that we get the same trustworthy data at the end of those pipelines.

The gradual improvement of the data platform’s overall health is now made possible by the detection of issues, awareness of the upstream root causes, and efficient processes for iteration.

Data observability vs. data quality

Data observability supports data quality, but the two are different aspects of managing data. While data observability practices can point out quality problems in data sets, they can’t on their own guarantee good data quality—that requires efforts to fix data issues and to prevent them from occurring in the first place. On the other hand, an organization can have strong data quality even if it doesn’t implement a data observability initiative.

Data quality measures whether the condition of data sets is good enough for their intended uses in operational and analytics applications. To make that determination, data is examined based on various dimensions of quality, such as accuracy, completeness, consistency, validity, reliability, and timeliness.

Data observability vs. data governance

Data observability and data governance are complementary processes that support each other.

Data governance aims to ensure that an organization’s data is available, usable, consistent, and secure and that it’s used properly, in compliance with internal standards and policies. Governance programs often incorporate or are closely tied to data quality improvement efforts. 

A strong data governance program helps eliminate the data silos, data integration problems, and poor data quality that can limit the value of data observability practices. 

Data observability can aid the governance program by monitoring changes in data quality, availability, and lineage.

Features of data observability

To make data observability useful, it needs to include these activities:

  • Monitoring — a dashboard that provides an operational view of your pipeline or system
  • Alerting — both for expected events and anomalies
  • Tracking — the ability to set and track specific events
  • Comparisons — monitoring over time, with alerts for anomalies
  • Analysis — automated issue detection that adapts to your pipeline and data health
  • Logging — a record of an event in a standardized format for faster resolution
  • SLA tracking — the ability to measure data quality and pipeline metadata against pre-defined standards

How is this any different from the activities that data teams already do? The difference lies in how these activities fit into the end-to-end data operations workflow and the level of context they provide on data issues.

For most organizations, observability is siloed. Teams collect metadata on the pipelines they own. Different teams are collecting metadata that may not connect to critical downstream or upstream events. More importantly, that metadata isn’t visualized or reported on a dashboard that can be viewed across teams.

There may be standardized logging policies for one team, but not for another, and there’s no way for other teams to easily access them. Some teams may run algorithms on datasets to ensure they are meeting business rules. But the team that builds the pipelines doesn’t have a way to monitor how the data is transforming within that pipeline and whether it will be delivered in a form the consumers expect. The list can go on and on.

Without the ability to standardize and centralize these activities, teams can’t have the level of awareness they need to proactively iterate their data platform. A downstream data team can’t trace the source of their issues upstream, and an upstream data team can’t improve their processes without visibility into downstream dependencies.

Hierarchy of data observability

All data observability isn’t created equal. The level of context you can achieve depends on what metadata you can collect and provide visibility on. We call this the hierarchy of data observability. Each level is a foundation for the next and allows you to attain a finer grain of observability.

Hierarchy of data observability

Monitoring operational health, data at rest and in motion

Getting visibility into your operational and dataset health is a sound foundation for any data observability framework.

Data at rest

Monitoring dataset health refers to monitoring your dataset as a whole. You are getting awareness into the state of your data while it’s in a fixed location, which we refer to as data at rest.

This type of monitoring answers questions like:

  • Did this dataset arrive on time?
  • Is this dataset being updated as frequently as you need it to be?
  • Is the expected volume of data available in this dataset?

Data in motion

Operational monitoring refers to monitoring the state of your pipelines. This type of monitoring gives you awareness into the state of your data while it’s transforming and moving through your pipelines. We refer to this data state as data in motion.

This type of monitoring answers questions like:

  • How does pipeline performance affect the dataset quality?
  • Under what conditions is a run considered successful?
  • What operations are transforming the dataset before it reaches the lake or warehouse?

While dataset and data pipeline monitoring are usually separated into two different activities, it’s essential to keep them coupled to achieve a solid foundation of observability. These two states are highly interconnected and dependent on each other. Siloing out these two activities into different tools or teams makes it more challenging to get a high-level view of your data’s health.

Column-level profiling

Column-level profiling is key to this hierarchy. Once a solid foundation has been laid for it, column-level profiling gives you the insights you need to establish new business rules for your organization and enforce existing ones at the column level as opposed to just the row level. That level of awareness allows you to improve your data quality framework in a very actionable way.

This level of observability allows you to answer questions like:

  • What is the expected range for a column?
  • What is the expected schema of this column?
  • How unique is this column?

Row-level validation

From here, you can move up to the final level of observability: row-level validation. This looks at the values in each row and validates that they are accurate.

This type of observability looks at:

  • Are the values in each row in the expected form?
  • Are the values the exact length you expect them to be?
  • Given the context, is there enough information here to be useful to the end-user?

Many organizations get tunnel vision on row-level validation, but that’s really just mistaking the trees for the forest. By building your observability framework starting with Operational & Dataset monitoring, you can get big picture context on your data platform’s health while still honing in on the root cause of issues and their impact downstream.

What are data observability tools?

Data observability tools are typically offered as part of DataOps platforms. DataOps platforms assemble several types of data management software into an individual, integrated environment. The platform unifies all the development and operations in data workflows. Data observability software focuses on monitoring the health of the data pipelines and the overall system.

At its core, a good observability tool should have the following capabilities:

  • Collect, review, sample, and process telemetry data across multiple data sources
  • Offer comprehensive monitoring across your network, infrastructure, servers, databases, cloud applications, and storage
  • Serve as a centralized repository to support data retention and fast access to data
  • Provide data visualization

The best observability tools go beyond these capabilities to automate security, governance, and operations practices. They also offer affordable storage solutions so your business can continue to scale as data volumes grow.

To choose the right data observability tool, start by examining your existing IT architecture and finding a tool that integrates with each of your data sources. Look for tools that monitor your data at rest from its current source—without the need to extract it—alongside monitoring your data in motion through its entire lifecycle.

Challenges of data observability

The right data monitoring system can transform how organizations manage and maintain their data. However, implementing data observability can pose challenges for some organizations, depending on their existing IT architecture. Common challenges include:

Integration with the full data ecosystem

Even the best observability tools can fall short without insight into the full data pipeline and all the software, servers, databases, and applications involved. Data observability can’t work in a vacuum, so it’s important to eliminate data silos and integrate as many systems as possible into your data observability software. Some organizations struggle to gain the buy-in necessary to incorporate every system and tool into their observability solution.

Standardization of data sources

Large organizations maintain hundreds or even thousands of data sources, and necessarily, data from these sources will not abide by the same standards. The best observability tools focus on standardizing telemetry data and logging guidelines to effectively correlate information, but standardizing data may require a manual effort.

Storage and retention

Depending on how your data is stored and your organization’s retention policies, some tools may come with prohibitive storage costs that limit scalability.

Implementing a data observability framework

The framework’s moving pieces

So, how do we actually implement a data observability framework that can improve our end-to-end data quality? What metrics should we be tracking at each stage?

Here are the ingredients of a high-functioning data observability framework:

  1. DataOps Culture
  2. Standardized Data Platform
  3. Unified Data Observability Platform

Before you can even think about producing a high-value data product, you need mass adoption of the DataOps Culture. You need everyone bought into this, but you especially need leadership bought in. They are the ones that dictate the systems and processes for development, maintenance, and feedback. As powerful as a bottom-up movement can be, you need budget approvals to make the technological changes needed to support a DataOps system.

Once everyone is bought into the idea of being efficient, leadership can move the organization toward a standardized data platform. What do we mean by that? To get end-to-end ownership and accountability across all teams, you need infrastructure in place that will allow teams to speak the same language and openly communicate about issues. That means you need standardized libraries for API & data management (i.e., querying data warehouse, read/write from the data lake, pulling data from APIs, etc.). You need a standardized library for data quality. You need source code tracking, data versioning, and CI/CD processes.

With that, your infrastructure is set up for success. Now you need a unified observability platform that gives your entire organization open access to your system health. This observability platform would act as a centralized metadata repository. It would encompass all the features listed earlier (like monitoring, alerting, tracking, comparison, and analysis) so data teams could get an end-to-end view of how the sections of the platform they own are affecting other sections.

Example metrics to track

Culture? Check.

Standardized Data Platform? Check.

Unified Data Observability Platform? Check?

You have all the moving pieces in place, but what should you be tracking? Let’s refer back to the Hierarchy of Data Observability. 

For simplicity’s sake, we’ve distilled this into a graphic:

Data Observability Framework

For operational health, you should be collecting execution metadata. This includes metadata on pipeline states, duration, delays, retries, and times between subsequent runs.

For dataset monitoring, you should look at your dataset’s completeness, availability, the volume of data in and out, and schema changes.

For column-level profiling, you should collect summary statistics on columns and use anomaly detection to alert changes. You’d be looking at Mean, Max, and Min trends within columns.

For row-level validation, you’d ensure the previous checks didn’t fail at the row level and your business rules. This is very contextual, so you’ll have to use your discretion.


Data observability is the backbone of any data team’s ability to be agile and iterate on their products. Without it, a team cannot rely on its infrastructure or tools because errors can’t be tracked down quickly enough. This leads to less agility in building new features and improvements for your customers — which means you’re essentially throwing away money by not investing in this key piece of the DataOps framework! If you want to learn more about how our platform delivers complete visibility into all aspects of your system, get in touch with us today!

Keep up with the Databand community