dbt Alerting and Monitoring with Databand
As organizations rely more heavily on dbt Core and Cloud to drive transformations, dbt alerting and monitoring has become increasingly important.
With interconnected data pipelines becoming more complex and distributed across multiple systems, it can be challenging to track how data and identify dbt issues before they impact the business.
This is where continuous data observability comes into play.
By implementing a data observability strategy, data teams can gain real-time visibility into their data pipelines and identify issues before they significantly impact operations.
In this blog, we’ll show how continuous data observability integrates with dbt Cloud jobs and tests within the context of Apache Airflow.
You can watch a demo of the video below or continue reading the highlights from it.
Setting the stage with Airflow DAG overview
Let’s get started! For our example today, I’m going to pick one of the Airflow DAGs here called service_311_closed_requests.
So, what’s happening in this DAG?
We’re pulling data from a public API provided by the city of New York around any 311 issues. This could be things like potholes or anything needing maintenance related to sanitation. The City of New York tracks those requests and makes those publicly available so it’s a great source for demos like this.
For this DAG that we will be looking at today, here’s what’s going on:
- An upstream DAG already loaded with some staging data into Redshift.
- Once that data is in Redshift, we are going to kick off a dbt Cloud job that will have both a dbt run and dbt tests.
- Databand’s load sensor will await the completion of that dbt Cloud job.
- Databand will track and log the results of dbt runs and tests.
- Finally, we’ll provide a summary of the load.
Databand’s dbt Cloud tracking capabilities automatically detects and alerts on any issues within the Airflow or dbt Cloud jobs. This allows us to send information to your data teams so they can solve these issues quickly.
Example of Databand alerts
The necessary information to solve those issues so quickly can be shown by an example of what some of these alerts might look like. In this case, we’re looking at our pipeline that is upstream.
This is where we’re fetching that API data. We can see that we had a pipeline failure on the February 16th run at 9:00 AM, UTC.
Because of this, some of the info that Databand will provide will bring the error message front and center.
The first thing any data engineer will ask when they see a failure is “What caused the failure? Give me the traceback.”
This alert has been delivered to some alert receiver we’ve set up, so that might be an email address. It could also be a Slack or a Teams channel. It also might be PagerDuty, among other potential receivers.
You can quickly see my error message here. It looks like this is a permission error. This is where it looks like some sort of token for my cloud service provider, meaning it just needs to be refreshed.
It looks like this is a permission error. This is where it looks like some sort of token for my cloud service provider, meaning it just needs to be refreshed.
I can immediately see the issue by coming to this page without going to Airflow and combing through the logs. This is a meaningful error message because I know exactly what needs to happen.
If I was interested in the logs, those are also available within Databand. I can see the error message, but I can see some of the surrounding logs leading up to this error. In addition to this, we will have the pipeline name where this failure occurred.
We’ll have the specific run where this failure happened, I could jump straight from that link over to that run in Databand.
If I expand it over here, I’ll see the same error message as before, but I can also see any parameters that might have been passed to each task within this pipeline.
Integration with dbt alerting and monitoring
Now, let’s discuss what we’re doing with dbt alerting and monitoring specifically.
dbt Core and dbt Cloud are extremely popular for data house transformations and testing on your tables as you load them.
Databand comes into the picture here by providing visibility on what’s happening in dbt and combining it with what you’re already getting in observability for your pipeline.
In this case, we’ve got Airflow triggering some dbt Cloud job. Now we want to pull in the results of that dbt job within the context of the Airflow DAG.
What Databand is provided is a Python function within our Python SDK that can be used to intercept the results of your dbt jobs.
Here’s an example of that code that’s doing the below:
- Call our track dbt run function.
- Pass a job ID to this, and this job ID is just passed from upstream, from an earlier task in Airflow.
- Tell the function our account ID. (We have a secret here that is containing our API key, which is used to authenticate.)
- Use the job ID to determine what job we want to look at.
Tracking dbt jobs
Now that we have that, we can see that there are some sub-tasks within this tracking task.
If I expand the task, you’ll see what those sub-tasks are. Within here, we have the dbt job that has been triggered. This job is called closed_requests_incremental_load, and then we’ve got an ID attached to that, our job ID.
Then if we expand this, we’ll start to get into some specific details of what happened in dbt.
Viewing dbt logs
In this scenario, we have both a dbt run command and a dbt test command.
One of the first things I’ll show here is that Databand can pull all the logs from dbt into your Databand environment. The goal here is that we don’t want you ever to feel like you need to leave your Databand environment to investigate these issues.
If I were to click on either of these tasks over here, whether it’s the run or the test, I’ll expand this section over here where I’ve selected the logs tab.
Now for my dbt run, I can see the logs coming from dbt. It will tell me, what models were kicked off, how long they took, and whether they succeeded or failed.
It is the same for the dbt test command, I can see what those tests were.
It looks like I had six tests. It looks like I had two failures. One of those had 21 records that were violations, and one of those had one record.
If I jump over here to dbt cloud, this is the actual dbt job that is being tracked here.
You’ll notice that I’ve got the command I ran for this step, dbt run –select close_requests, which is the same info in Databand. We’re giving you all this information you see in dbt within your Databand environment.
Accelerating resolution time for data teams
Looking back into the Databand environment, we can discuss how this will add value to roles like data engineers, analytics engineers, or data analysts.
We’ll start by expanding my dbt run here. I’m going to click off my logs tab. When I expand this dbt run sub-task, I will get any models processed as part of this dbt job.
In this case, we only have a single model, but if you were loading whether it’s a few tables, or dozens of tables, or even hundreds of tables at a time, we’re going to pull the information from all those tables under this run command.
We will show the same lineage graph in Databand, “This table becomes an input for these two tables, and these two tables are inputs for these four tables.”
We will do a similar thing with our dbt test over here. Each test kicked off within dbt becomes its own task here in Databand. If I mouse over any of these, I can see what some of these task names are.
I’ve got some unique checks, some nut_null checks, an accepted values test, and another not_null test. I can see, from a glance, that I had two of these failed tests.
Now, I will show how Databand can pull all the SQL that’s happening under the hood in dbt into Databand to show this information in an easy-to-ingest manner.
I’ve clicked on one of my models, and I’ve switched to my code tab. Now, I see the exact code that is part of my dbt project.
If you’ve ever worked with dbt before, you might understand it can be difficult to get from the homepage down to looking at the SQL behind any given model or test.
In Databand, it’s made simple. Click your model within your dbt run command, and go to the code tab. Now, you’ve got all the information.
As a data analyst, I can quickly see info; this is a table rather than a view.
For example, this comes from my stg_hourly_data table. I can see any filters that I have applied, and then I can see the specific columns that I’m calling or selecting. This is one of those cases where I might have some sort of special column, special calculated column, or business KPI.
If I’m looking at this in the warehouse, it might not be clear to a data analyst what is happening behind the scenes for one of these calculations.
But since I’ve got my SQL available here in Databand, I can quickly see, this days_open column and how it is calculated.
Schema change alerting
If I switch to my data interactions tab, you’ll see a similar view here.
We could now see some info about the database type and execution time. I can see the schema that was loaded here.
Anytime we talk about schema and Databand, understand that you can create any schema change alerts you want.
With the schema change alerts, that’s going to give you the ability to see anytime columns are added, see anytime columns are removed, or see anytime that your data types change on any of your columns.
Looking down at the bottom, we’ll see the inputs for this table, as well as the outputs. The output, of course, is the table itself, or the model itself.
The inputs here are any of the tables that are being referenced as part of our SQL appears. This is just our staging table, but if this were a select statement with multiple joins, you would also see all of those here.
Similarly, I will see much of the same info for my tests. If you’ve worked with dbt tests before, you know it’s not always simple to get to the SQL executed under the hood within dbt.
You’re specifying your tests within some YAML file, but it is a column name, along with some test rule that you want to have applied.
With Databand, we can pull the SQL from that test directly into Databand. If I’m a data engineer, analytics engineer, or data analyst, it’s easy for me to see what’s happening under the hood.
As an example, we’ve got this unique test. To confirm this, I can come over here and check my SQLs. We’re selecting some unique fields, counting records, and then flagging cases where some unique field has more than one instance. This test passed, which tells me we didn’t have any cases that matched these conditions, which is great.
On the other hand, we might have a not_null test here, where we’re selecting everything from our table and specifically looking for not_null in this resolution description field. I’m selecting everything from my table again, where my resolution description field is null.
The way that a lot of dbt tests work is that there will be some select statement like this, where you’re getting the records that match that condition, and then on top of that, it’s applying a select count of everything from the results of whatever this query is.
If I come back over here to my metrics tab, and I’m going to go to my dbt tab, I’ve got this specific test selected. Over here, each test is represented as a task name in here. But now I can see we had 21 failures.
Now you can chart the number of failures over time.
dbt test alerts
Now I want to set up a dbt alert so that I know anytime I have test failures.
One way that I would do that is by creating a dbt test alert.
What this looks like is:
- Select one of my pipelines. In case we’re talking about the closed_request pipeline.
- Give it an alert severity, so this is how I determine how high of a priority it is to resolve this.
- Give this dbt alert a name and a description.
Once I’ve saved this, we will get notifications if any of our tests fail every time this pipeline runs, and Databand picks it up.
If I come over to my alerts and filter on dbt tests, I’ll see that I have quite a few examples of failed tests, and I can see each of these as individual incidents. Remember, as I mentioned earlier, these incidents are routed to potentially numerous locations, such as email, Slack, Teams, or PagerDuty.
I’ll pick one of my tests here. I have this not_null test on my closed_day field. A data engineer investigating this alert will see a few things.
What is the pipeline where this test failure occurred? What is the run ID where this occurred?
A data engineer investigating this alert will see a few things.
What is the pipeline where this test failure occurred? What is the run ID where this occurred?
I’ll get some helpful info over here on both this section here on the left, as well as the SQL over here on the right. This is a test that is performed on top of our closed_request table.
The two things I want to see here what is the SQL behind that table? Which, I can see that under my origin section here. This is the same thing we showed on the run page a bit ago. This is all the logic behind the closed_requests table.
Since I’ve also got the logic behind the test, I can see exactly what the details are behind this issue. I can see the SQL of my table, and the SQL of my test itself, and I can start to investigate this.
The next course of action at this point for a data engineer might be to go look at this table within the warehouse. I can check for those records where we have null values for our close_day field. Now I can follow my typical remediation process to resolve these issues faster.
Now you’ve seen the value that can be provided by tracking your Airflow jobs and your dbt jobs as part of those. The main takeaway is that we don’t want you ever to feel like you need to leave your Databand environment to investigate any of these issues.
From this single screen, I can see the logs for each of my tasks in Airflow. I’m able to see the status and duration of each of my Airflow tasks.
Then, I can see the logs, statuses, and durations of each of my tables in my dbt run command, as well as each of my tests within my dbt test command.
Implement dbt alerting and monitoring.
Increase your team’s visibility so they can catch dbt issues sooner.