What is Data Observability?

What is data observability?

Data observability refers to the practice of monitoring, managing and maintaining data in a way that ensures its quality, availability and reliability across various processes, systems and pipelines within an organization.

Data observability is about truly understanding the health of your data and its state across your data ecosystem. It includes a variety of activities that go beyond traditional monitoring, which only describes a problem. Data observability can help identify, troubleshoot and resolve data issues in near-real time.

Utilizing data observability tools is essential for getting ahead of bad data issues, which sit at the heart of data reliability. These tools enable automated monitoring, triage alerting, tracking, comparisons, root cause analysis, logging, data lineage and service level agreement (SLA) tracking, all of which work together to help practitioners understand end-to-end data quality—including data reliability.

Implementing a data observability solution is especially important for modern data teams, where data is used to gain insights, develop machine learning models and drive innovation. It ensures that data remains a valuable asset rather than a potential liability.

Data observability needs to be infused consistently throughout the end-to-end data lifecycle. That way, all data management activities involved are standardized and centralized across teams for a clear and uninterrupted view of issues and impacts across the organization.

Data observability is the natural evolution of the data quality movement, which is making the practice of data operations (DataOps) possible.

IBM named a leader by Gartner

Read why IBM was named a leader in the 2023 Gartner® Magic Quadrant™ for Cloud AI Developer Services report.

Related content

Why data observability matters

Plain and simple, most organizations believe their data is unreliable:

82% say data quality concerns are a barrier to data integration projects¹
80% of executives do not trust their data²

The impact of this bad data can’t be underestimated. In May 2022, Unity Software discovered it had been ingesting bad data from a large customer, which led to a 30% plunge in the company’s stock³ and ultimately cost the business USD 110 million in lost revenue⁴.

Traditionally, it’s been difficult to identify bad data until it’s too late. Unlike when an application goes down and it affects thousands of users immediately, businesses can operate on bad data unknowingly for quite some time. For example, a sales team would know right away if a Salesforce dashboard wasn’t loading, but there’s no telling how long it would take them to figure out that a dashboard was showing incorrect data.

Data observability is the best defense against bad data getting through. It monitors data pipelines to ensure complete, accurate and timely delivery of data so that data teams can prevent data downtime, meet data SLAs and maintain the business’s trust in the data it sees.

The evolution of data observability

Modern data systems provide a wide variety of functionality, allowing users to store and query their data in many different ways. Of course the more functionality you add, the more complicated it becomes to ensure that your system works correctly. This complication includes:

More external data sources

In the past, data infrastructure was built to handle small amounts of data—usually operational data from a few internal data sources—and the data was not expected to change very much. Now, many data products rely on data from internal and external sources, and the sheer volume and velocity in which this data is collected can cause unexpected drift, schema changes, transformations and delays.

More complicated transformations

More data ingested from external data sources means you need to transform, structure and aggregate all that data in all other formats to make it all usable. Even worse, if those formats change at all, it causes a domino effect of failures downstream as the strictly coded logic fails to adapt to the new schema.

Too much focus on analytics engineering

Complex ingestion pipelines have created a market for tools to simplify this end-to-end process, mostly automating the ingestion and extract, transform, load (ETL)/extract, load, transform (ELT) processes. Combining them together, you get a data platform the analytics industry has dubbed the “modern data stack,” or MDS. The goal of the MDS is to reduce the amount of time it takes for data to be made usable for end-users (typically analysts) so they can start leveraging that data faster. However, the more automation you have, the less control you have over how data is delivered. These organizations need to build out custom data pipelines so they can better guarantee data is delivered as expected.

Data observability and the DataOps movement

Data operations (DataOps) is a workflow that enables an agile delivery pipeline and feedback loop so that businesses can create and maintain their products more efficiently. DataOps allows companies to use the same tools and strategies throughout all phases of their analytics projects, from prototyping to product deployment.

The DataOps cycle outlines the fundamental activities needed to improve data management within the DataOps workflow. This cycle consists of three distinct stages: detection, awareness and iteration.

Detection

It’s important that this cycle starts with detection because the bedrock of the DataOps movement is founded on a data quality initiative.

This first stage of the DataOps cycle is validation-focused. These include the same data quality checks that have been used since the inception of the data warehouse. They were looking at column schema and row-level validations. Essentially, you are ensuring all datasets adhere to the business rules in your data system.

This data quality framework that lives in the detection stage is important but reactionary by its very nature. It’s giving you the ability to know whether the data that’s already stored in your data lake or data warehouse (and likely already being utilized) is in the form you expect.

It’s also important to note that you are validating datasets and following business rules you know. If you don’t know the causes of issues, you cannot establish new business rules for your engineers to follow. This realization fuels the demand for a continuous data observability approach that ties directly into all stages of your data lifecycle, starting with your source data.

Awareness

Awareness is a visibility-focused stage of the DataOps phase. This is where the conversation around data governance comes into the picture and a metadata-first approach is introduced. Centralizing and standardizing pipeline and dataset metadata across your data ecosystem gives teams visibility into issues within the entire organization.

The centralization of metadata is crucial to giving the organization awareness into the end-to-end health of its data. Doing this allows you to move toward a more proactive approach to solving data issues. If there is bad data that is entering your “domain,” you can trace the error to a certain point upstream in your data system. For example, Data Engineering Team A can now go on to look at Data Engineering Team B’s pipelines and be able to understand what’s going on and collaborate with them to fix the issue.

The vice-versa also applies. Data Engineering Team B can detect an issue and trace what impact it will have on downstream dependencies. This means Data Engineering Team A will know that an issue will happen and can take whatever measures are necessary to contain it.

Iteration

Here, teams focus on data-as-code. This stage of the cycle is process-focused. Teams are ensuring that they have repeatable and sustainable standards that will be applied to all data development to ensure that they get the same trustworthy data at the end of those pipelines.

The gradual improvement of the data platform’s overall health is now made possible by the detection of issues, awareness of the upstream root causes and efficient processes for iteration.

Data observability benefits

A well-executed data observability strategy can deliver a range of benefits that contribute to better data quality, decision-making, reliability and overall organizational performance. These include:

Higher data quality

Data observability allows teams to detect issues such as missing values, duplicate records or inconsistent formats early on before they affect downstream dependencies. With higher-quality data, organizations are enabled to make better, data-driven decisions that lead to improved operations, customer satisfaction and overall performance.

Faster troubleshooting

Data observability enables teams to swiftly identify errors or deviations in data through anomaly detection, real-time monitoring and alerts. Faster troubleshooting and issue resolution helps minimize the cost and severity of downtime.

Improved collaboration

By using shared dashboards offered by data observability platforms, various stakeholders can gain visibility into the status of critical datasets, which can foster better collaboration across teams.

Increased efficiency

Data observability tools help pinpoint bottlenecks and performance issues, enabling engineers to optimize their systems for better resource usage and quicker processing times. In addition, automation reduces the time and effort required to maintain the health of your data, allowing data engineers, analysts and data scientists to focus their efforts on deriving value from the data.

Improved compliance

Data observability can help organizations in highly regulated industries such as finance, healthcare and telecommunications ensure that their data meets the necessary standards for accuracy, consistency and security. This reduces the risk of non-compliance and associated penalties.

Enhanced customer experience

High-quality data is essential for understanding customer needs, preferences and behaviors, which, in turn, enables organizations to deliver more personalized, relevant experiences. Data observability can help organizations maintain accurate, up-to-date customer data, leading to improved customer satisfaction and loyalty.

Increased revenue

By improving data quality through observability, organizations can unlock new insights, identify trends and discover potential revenue-generating opportunities. Making the most of their data assets, organizations can increase their revenue and growth.

The 5 pillars of data observability

Together, the five pillars of data observability provide valuable insight into the quality and reliability of your data.

1. Freshness

Freshness describes how up to date your data is and how frequently it is updated. Data staleness occurs when there are important gaps in time when the data has not been updated. Often, when data pipelines break it is due to a freshness issue.

2. Distribution

An indicator of your data’s field-level health, distribution refers to whether or not the data falls within an accepted range. Deviations from the expected distribution might indicate data quality issues, errors or changes in the underlying data sources.

3. Volume

Volume refers to the amount of data being generated, ingested, transformed and moved through various processes and pipelines. It also refers to the completeness of your data tables. Volume is a key indicator as to whether or not your data intake is meeting expected thresholds.

4. Schema

Schema describes the organization of your data. Schema changes often result in broken data. Data observability helps ensure that your data is organized consistently, is compatible across different systems and maintains its integrity throughout its lifecycle.

5. Lineage

Lineage’s purpose is to answer the question, “Where?” when data breaks. It looks at the data from its source to its end location and notes any changes, including what changed, why it changed and how it changed along the way. Lineage is most often represented visually.

Data observability vs. data quality

Data observability supports data quality, but the two are different aspects of managing data.

While data observability practices can point out quality problems in data sets, they can’t on their own guarantee good data quality. That requires efforts to fix data issues and to prevent them from occurring in the first place. On the other hand, an organization can have strong data quality even if it doesn’t implement a data observability initiative.

Data quality monitoring measures whether the condition of data sets is good enough for their intended uses in operational and analytics applications. To make that determination, data is examined based on various dimensions of quality, such as accuracy, completeness, consistency, validity, reliability, and timeliness.

Data observability vs. data governance

Data observability and data governance are complementary processes that support each other.

Data governance aims to ensure that an organization’s data is available, usable, consistent and secure, and that it’s used in compliance with internal standards and policies. Governance programs often incorporate or are closely tied to data quality improvement efforts.

A strong data governance program helps eliminate the data silos, data integration problems and poor data quality that can limit the value of data observability practices.

Data observability can aid the governance program by monitoring changes in data quality, availability and lineage.

The hierarchy of data observability

All data observability isn’t created equal. The level of context you can achieve depends on what metadata you can collect and provide visibility into. This is known as the hierarchy of data observability. Each level is a foundation for the next and allows you to attain increasingly finer grains of observability.

Monitoring operational health, data at rest and in motion

Getting visibility into your operational and dataset health is a sound foundation for any data observability framework.

Data at rest

Monitoring dataset health refers to monitoring your dataset as a whole. You are getting awareness into the state of your data while it’s in a fixed location, which is referred to as “data at rest.”

Dataset monitoring answers questions like:

Did this dataset arrive on time?
Is this dataset being updated as frequently as needed?
Is the expected volume of data available in this dataset?

Data in motion

Operational monitoring refers to monitoring the state of your pipelines. This type of monitoring gives you awareness into the state of your data while it’s transforming and moving through your pipelines. This data state is referred to as “data in motion.”

Pipeline monitoring answers questions like:

How does pipeline performance affect the dataset quality?
Under what conditions is a run considered successful?
What operations are transforming the dataset before it reaches the lake or warehouse?

While dataset and data pipeline monitoring are usually separated into two different activities, it’s essential to keep them coupled to achieve a solid foundation of observability. These two states are highly interconnected and dependent on each other. Siloing out these two activities into different tools or teams makes it more challenging to get a high-level view of your data’s health.

Column-level profiling

Column-level profiling is key to this hierarchy. Once a solid foundation has been laid for it, column-level profiling gives you the insights you need to establish new business rules for your organization and enforce existing ones at the column level as opposed to just the row level.

This level of awareness allows you to improve your data quality framework in a very actionable way.

It allows you to answer questions like:

What is the expected range for a column?
What is the expected schema of this column?
How unique is this column?

Row-level validation

From here, you can move up to the final level of observability: row-level validation. This looks at the data values in each row and validates that they are accurate.

This type of observability looks at:

Are the data values in each row in the expected form?
Are the data values the exact length you expect them to be?
Given the context, is there enough information here to be useful to the end user?

When organizations get tunnel vision on row-level validation, it becomes difficult to see the forest for the trees. By building an observability framework starting with operational and dataset monitoring, you can get big picture context on the health of your data while still honing in on the root cause of issues and their downstream impacts.

Implementing a data observability framework

Below are the main steps typically involved in building a successful observability pipeline. The process involves the integration of various tools and technologies, as well as the collaboration of different teams within an organization.

Define key metrics: Start by identifying the critical metrics that you need to track. This could include data quality metrics, data volumes, latency, error rates and resource utilization. The choice of metrics will depend on your specific business needs and the nature of your data pipeline.
Choose appropriate tools: Next, choose the tools you’ll need for data collection, storage, analysis and alerting. Ensure that the tools you select, including open-source, are compatible with your existing infrastructure and can handle the scale of your operations.
Standardize libraries: Put infrastructure in place that allows teams to speak the same language and openly communicate about issues. This includes standardized libraries for API and data management (i.e., querying data warehouse, read/write from the data lake, pulling data from APIs, etc.) and data quality.
Instrument your data pipeline: Instrumentation involves integrating data collection libraries or agents into your data pipeline. This allows you to collect the defined metrics from various stages of your pipeline. The goal is to achieve comprehensive visibility, so it’s key to ensure every crucial stage is instrumented.
Set up a data storage solution: The collected metrics need to be stored in a database or a time-series platform that can scale as your data grows. Ensure the storage solution you choose can handle the volume and velocity of your data.
Implement data analysis tools: These tools help derive insights from the stored metrics. For more in-depth analysis, consider using tools that provide intuitive visualizations and support complex queries.
Configure alerts and notifications: Establish a system for sending automated alerts when predefined thresholds are crossed or anomaly detection occurs. This will help your team respond promptly to issues, minimizing any potential downtime.
Integrate with incident management platforms: In addition to detecting issues, observability also involves the effective management of issues. Integrating your observability pipeline with an incident management system can help streamline your response workflows.
Regularly review and update your observability pipeline: As your business evolves, so will your data and requirements. Regularly reviewing and updating your observability pipeline ensures it will continue to provide the necessary insights and performance.

Building an observability pipeline is a continuous process of learning and refinement. It’s crucial to start small, learn from the experience and incrementally expand your observability capabilities.

Footnotes

¹ Data Integrity Trends: Chief Data Officer Perspectives in 2021 (link resides outside ibm.com), Precisely, June 2021

² The data powered enterprise: Why organizations must strengthen their data mastery (link resides outside ibm.com), Capgemini, February 2021

³ Unity Software's stock plunges nearly 30% on weak revenue guidance (link resides outside ibm.com), MarketWatch, 10 May 2022

⁴ 2 Reasons Unity Software’s Virtual World is Facing a Stark Reality (link resides outside ibm.com), The Motley Fool, 17 July 2022