How to analyze dataset performance and schema changes in Databand
“Why did my dataset schema change?” Yeah, we hear this question a lot too.
Unfortunately, most data engineers don’t realize the schema has changed until someone else downstream tells them. By then, the business impact has already happened.
Databand helps fix this problem by capturing the metadata from your datasets and then alerting you when dataset operations change unexpectedly.
In this blog, we’re going to show a dataset alert triggered by a schema change and how to analyze the results.
Watch the video to see it in action or continue reading below.
Analyze dataset health
Databand provides an overview of all your datasets in one location. Here you can see fields like the type (e.g., S3, BigQuery, Snowflake), path, and the number of data issues associated with each dataset.
Let’s drill into the dataset named “S3 – Raw Hourly Data” since we have 31 issues and see why it’s problematic.
Now we’re in an overview of this dataset’s performance.
Here you can see many daily rows were written and read, along with the daily operations performed over time. Databand also provides you with a quick issue summary of the schema changes or failed operations.
View historical issue trends
By selecting the issues link, you’re brought into a history screen to see all the historical issues associated with this dataset.
Here you can see every read and write operation associated with the dataset.
For example, you can see two different operations with issues, one is a read from service_311_closed_requests pipeline and the other is a write operation from the service_311_ get_data pipeline.
You can also dive into the schema change details by selecting the schema column. This image shows that “Column added: incident_address” was an unexpected schema change.
To understand the pipeline runs connected to the issues, you can select the run within the origin column.
View operation performance
By going to the operations tab, you can select read or write operations to understand their performance over time.
In this record count example, you see that we fell out of an anomaly range. When this happens, Databand would fire an alert to notify the data engineer to investigate the issue.
On the left side of the screen, you’ll notice every column of the dataset. This allows you to select each column to understand how the operation performed.
For example, if I select the borough column, see that Databand detected an anomaly for the distinct count. We were excepting a max/min of five, but we got six instead.
Wrapping it up
Implement data quality monitoring
Increase your team’s visibility so they can catch dataset issues sooner.