How to analyze dataset performance and schema changes in Databand

“Why did my dataset schema change?” Yeah, we hear this question a lot too.  

Unfortunately, most data engineers don’t realize the schema has changed until someone else downstream tells them. By then, the business impact has already happened. 

Databand helps fix this problem by capturing the metadata from your datasets and then alerting you when dataset operations change unexpectedly.

In this blog, we’re going to show a dataset alert triggered by a schema change and how to analyze the results.

Watch the video to see it in action or continue reading below. 

Analyze dataset health

Databand provides an overview of all your datasets in one location. Here you can see fields like the type (e.g., S3, BigQuery, Snowflake), path, and the number of data issues associated with each dataset. 

Analyze Datasets - Overview

Let’s drill into the dataset named “S3 – Raw Hourly Data” since we have 31 issues and see why it’s problematic.

Dataset overview

Now we’re in an overview of this dataset’s performance. 

Here you can see many daily rows were written and read, along with the daily operations performed over time. Databand also provides you with a quick issue summary of the schema changes or failed operations.

Daily Rows Written and Read
Dataset Issue Summary

View historical issue trends

By selecting the issues link, you’re brought into a history screen to see all the historical issues associated with this dataset.

Historical Operations and Issues

Here you can see every read and write operation associated with the dataset. 

For example, you can see two different operations with issues, one is a read from service_311_closed_requests pipeline  and the other is a write operation from the service_311_ get_data pipeline.

Read and Write Issues

You can also dive into the schema change details by selecting the schema column. This image shows that Column added: incident_address” was an unexpected schema change.

Historical Operations and Issues

To understand the pipeline runs connected to the issues, you can select the run within the origin column.

Pipeline View

View operation performance

By going to the operations tab, you can select read or write operations to understand their performance over time.

Select Operations

In this record count example, you see that we fell out of an anomaly range. When this happens, Databand would fire an alert to notify the data engineer to investigate the issue.

Anomaly Record Count

On the left side of the screen, you’ll notice every column of the dataset. This allows you to select each column to understand how the operation performed. 

For example, if I select the borough column, see that Databand detected an anomaly for the distinct count. We were excepting a max/min of five, but we got six instead.

Anomaly Record Distinct Count

Wrapping it up

For more information on how Databand can help you analyze and resolve dataset performance, check out our demo center or book a demo.

Implement data quality monitoring

Increase your team’s visibility so they can catch dataset issues sooner.