Press Release: IBM Acquires Databand to Extend Leadership in Observability Read now

Monitoring for data quality issues as early as ingestion: here’s why

2022-01-24 15:17:13

Monitoring for data quality issues as early as ingestion: here’s why

Maintaining data quality is challenging. Data is often unreliable and inconsistent, especially when it flows from multiple data sources. To deal with quality issues and prevent them from impacting your product and decision-making, you need to monitor your data flows. Monitoring helps identify schema changes, discover missing data, catch unusual levels of null records, fix failed pipelines, and more. In this blog post, we will explain why we recommend monitoring data starting at the source (ingestion), list three potential use cases for ingestion monitoring, and finish off with three best practices for data engineers to get started.

This blog post is based on the first episode of our podcast, “Why Data Quality Begins at the Source”, which you can listen to below or here.

Where Should You Monitor Data Quality?

Data quality monitoring is a fairly new practice, and different tools offer different monitoring capabilities across the data pipeline. While some tools monitor data quality where the data rests, i.e at the data warehouse, at Databand we think it’s critical to monitor data quality as early as ingestion, not just at the warehouse level. Let’s look at a few reasons why.

4 Reasons to Monitor Data Quality Early in the Data Pipeline

1. Higher Probability of Identifying Issues

Erroneous or abnormal data affects all the other data and analytics downstream. Once corrupt data has been ingested and flows to the data lake/warehouse, it might already be mixed up with healthy data and used in analyses. This makes it much more difficult to identify the errors and their source, because the dirty data can be “washed out” in the higher volumes of data that sits at rest

In fact, the ability to identify issues is based on an engineer or analyst knowing what the expected data results should be, recognizing a problematic anomaly, and diagnosing that this anomaly is the result of data and not a business change. When the corrupt data is such a small percentage of the entire data lake, this becomes even harder.

Why take the chance of overlooking errors and problems that could impact the product, your users, and decision making? By monitoring early in the pipeline, many issues can be avoided because you are monitoring a more targeted sample of data, and therefore able to create a more precise baseline for when data looks unusual.

data quality monitoring
2. Creating Confidence in the Warehouse

Analysts and additional stakeholders rely on the warehouse to make a wide variety of business and product decisions. Trusting warehouse data is essential for business agility and for making the right decisions. If data in the warehouse is “known” to have issues, stakeholders will not use it or trust it. This means the organization is not leveraging data to the full extent.

If the data warehouse is the heart of the customer-facing product, i.e the product relies almost entirely on data, then corrupt data could jeopardize the entire product’s adoption in the market.

By quality assuring the data before it arrives to the warehouse and the main analytical system, teams can improve confidence in that “trusted layer.”

3. Ability to Fix Issues Faster

By identifying data issues faster, data engineers have more time to react. They can identify causality and lineage, and fix the data or source to prevent any harmful impact that corrupt data could have. Trying to identify and fix full-blown issues in the product or after decision-making is much harder to do.

4. Enabling Data Source Governance

By analyzing, monitoring and identifying the point of inception, data engineers can identify a malfunctioning source and act to fix it. This provides better governance over sources, in real-time and in the long-run.

When Should You Monitor Data Quality from Ingestion?

We recommend monitoring data quality across the pipeline, from ingestion and at rests. However, you need to start somewhere… Here are the top three use cases for prioritizing monitoring at ingestion:

  • Frequent Source Changes – When your business relies on data sources or APIs where data structure frequently changes, it is recommended to continuously monitor them. For example, in the case of a transportation application that pulls data from the constantly changing APIs of location data, user tracking information, etc.
  • Multiple External Data Sources – When your business’s output depends on analyzing data from dozens or hundreds of sources. For example, a real-estate app that provides listings based on data from offices, municipalities, schools, etc.
  • Data-Driven Products – When your product is based on data and each data source has a direct impact on the product. For example, navigation applications that pull data about roads, weather, transportation, etc.

Getting Started with Data Quality Monitoring

As mentioned before, data quality monitoring is a relatively new practice. Therefore, it makes sense to implement it gradually. Here are three recommended best practices:

1. Determine Quality Layers

Data changes across the pipeline, and so does its quality. Divide your data pipeline into various steps, e.g the warehouse layer, the transformation layer, and the ingestion layer. Understand that data quality means different things at each of these stages and prioritize the layers that have the most impact on your business.

2. Monitor Different Quality Depths

When monitoring data, there are different quality aspects to review. Start with reviewing metadata and ensuring the data structure was correct and that all the data arrived. Once metadata has been verified, move on to address explicit business-related aspects of the data, which relate to domain knowledge.

3. See Demos of Different Data Monitoring Tools

Once you’ve mapped out your priorities and pain points, it’s time to find a tool that can automate this process for you. Don’t hesitate to see demos of different tools and ask the hard questions about data quality assurance. For example, to see a demo of Databand, click here. To learn more about data quality and hear the entire episode this blog post was based on, visit our podcast, here.

Data Monitoring Advice for When Things Absolutely Must Not Break

2021-03-30 15:01:50

Data Monitoring Advice for When Things Absolutely Must Not Break

In the past decade, DevOps engineers started noticing that they’d get frantic calls from their CEO if the application ever went down. This is how they knew they had become vital. Nowadays, many DataOps engineers are receiving that same honor—and the PagerDuty alerts to match.

This reliance on DataOps and data monitoring is going to increase. With the rise of analytics, machine learning, and the importance of data to the functioning of all software as a service, data powers the internet. Pipelines power that data. Yet too few engineers are building those pipelines with data monitoring in mind. When things go down, many DataOps teams are left grasping in the dark. Even when things aren’t down, they live in perpetual job-fail anxiety.

In this guide, we explore the vital importance of data monitoring to DataOps, why it becomes such an issue with large-scale or complex pipelines, and share a handful of best practices.

What is data quality monitoring?

If we’re going to define it, data quality monitoring is the ongoing process of measuring your data’s fitness for use. It isn’t taking action to address those issues—that’s beyond the scope of monitoring. Monitoring is simply knowing, in great detail, what’s happening within your data pipelines.

Why is data quality monitoring important?

Monitoring for data quality is important because issues with data will propagate through the pipeline and the negative effects can cascade. If the source data is tainted, everything that follows will be too. Without the right tools, it’s very difficult to identify the source of the corruption and trace any upstream or downstream processes that have been affected.

Data monitoring is only one part of the equation

The terms “monitoring” and “observability” are often used interchangeably, but there’s a distinction: Monitoring is just one piece of observability.

Data monitoring and data observability

Observability is the umbrella term for all the actions around understanding and improving the health of your pipeline, such as tracking, alerting, and recommendations. Yet the monitoring part (and the accuracy of the monitoring) are crucial.

Without the awareness that monitoring provides, you can’t take action to influence data quality. Not in any scientific way, at least. It’s tough to troubleshoot, and a pipeline without a monitoring tool integrated is a black box—you know what goes in and what comes out, but that’s it. A data monitoring software is what detects the errors or strange transformations, and tells you where they’re occurring.

Qualities of an effective data monitoring system:

Chart describing the five qualities of a useful data monitoring system. 1. Granular, 2. Persistent, 3. Automatic, 4. Ubiquitous, 5. Timely.

For a data monitoring system to be useful, it must be:

  • Granular—it must indicate specifically where an issue is occurring, and with what code.
  • Persistent—you must monitor things in a time-series, otherwise you can’t understand where data sets or errors began (lineage).
  • Automatic—the more freedom you have to set thresholds and use machine learning and anomaly detection, the less active attention it requires.
  • Ubiquitous—you can’t measure just one part of the pipeline.
  • Timely—because, what good are late alerts?

What about using an existing APM?

If you’re planning on starting to monitor pipelines and are considering using your existing application performance management (APM) tool, think again. Pipelines are a very different beast and you’re not going to get the granularity of data or the metrics you need to understand all four factors of data health. You will be able to extract duration, uptime, and some logging information, but you’ll be missing all the necessary and actionable information like data schema changes, granular task information, query costs, and other specific metrics.

The challenge with large-scale data pipelines

More complicated transformations, more operators touching the pipelines, and little coordination between operators begets vastly more complex DataOps systems. That’s where we’re at today—too many cooks and no prix fixe menu for what’s allowed and what isn’t.

Among the greatest challenges is how many non-technical participants are now reliant upon data pipelines to do their job. Demands come raining in from the business side from people—executives, analysts, and data scientists—who, through no fault of their own, don’t understand the data pipeline architecture. They don’t know the quirks of how the data is moved and stored. Yet they’re the ones deciding what must ultimately be delivered.

This is a big reason most data science projects fail to make it into production. They lack a common language and fail to involve the data engineer early on, in the requirements phase, when fixes are still cheap.

It’s a similar story for machine learning pipelines: Running the model and maintaining the model are more difficult with more people involved and no common language and not enough inter-group processes.

All this makes a case for data pipelines that are modular, more easily debugged, and well-monitored. Hence, data monitoring software.

Data monitoring best practices

To explain the order of operations you should go through to monitor your data pipeline, we’ve created what we call the data observability pyramid of needs, pictured. It’s your first data monitoring best practice.

The pyramid begins at the bottom, with the physical layer—are the pipelines executing? Did the Spark job run?—and proceed up into the increasingly theoretical realm. More advanced teams tend to be dealing with more higher-order issues at the top.

Chart describing the pyramid of data monitoring practices. From the top to bottom: 1. Data Access, 2. Data Trends, 3. Data Sanity, 4. Pipeline Latency, 5. Pipeline Execution

Putting best practices into action

To put this pyramid into practice, your data observability system should be checking for these issues in this order:

1. Is data flowing?

2. Is the data arriving in a useful window of time?

3. Is the data complete? Accurate? Fit?

4. How has it been changing over time? (Also called data lineage)

5. Are the people who need the data actually getting it?

To manage all of this automatically, there are of course data monitoring tools.

Advice on data monitoring tools

Like infrastructure as a service in DevOps, monitoring tools are best to buy not build. There’s a lot that goes into data monitoring and having a data monitoring system that’s maintained and improved can be a big time-savings, and free you to actually manage the pipeline.

Monitoring is most often one feature of a data monitoring service or platform. These data monitoring apps tend to also provide tools for awareness and remediation, such as tracking, alerts, and machine learning for anomaly detection.

Which is the best data manager app?

We’re biased, but for data engineers, is certainly on the list. We built it to provide full observability for data and machine learning pipelines for all the reasons covered in this article—because when suddenly, your CEO cares to know whether the pipeline is up, it pays to monitor it.

Know there's a data health issue before your consumers do

Get alerts on the leading indicators of data pipeline health issues so you can implement a fix before bad data gets through.