End-to-end data observability goes beyond your warehouse

Databand
2021-09-24 16:10:16

*Shhh-pop-pop-pop* 

You hear the sound of that dreaded slack notification. You’ve received an alert for low completeness in a critical table that lives in your warehouse. You’re responsible for your data platform and managing data quality across hundreds of pipelines. It’s up to you to fix this. The next few hours are about to be painful.

Now, you’re jumping between lineage tools, dashboards, and your orchestrator UI to try to figure out which pipelines feed into this table. All the while, the tickets from downstream consumers start to roll in. No pressure, right?

An hour later, you’ve narrowed your search from hundreds of pipelines, to tens. Which pipeline is the culprit? Which log will give you the answer you need? 

Looks like you only have one way of finding out: manually combing through each individual log to find that needle in the haystack. You prepare yourself for an arduous process.

Finally, you find your clue. After looking through dozens of logs, you find your error; a Type Change error. Great… one of the most dreaded errors of all — an error that easily slips under the radar of most pipeline monitoring tools & orchestrators.

You still have a long road ahead of you. It’s been hours since the first alert fired, and you don’t know why the schema change happened or how to fix it.

If the data source in this example is internal, you might be asking yourself:

  • Is there an error in the user code?
  • Did someone drop a column by mistake?
  • Was it intentional?

If the data source is external, you open up a whole other can of worms:

  • Did they change the schema purposely?
  • Is this a bug in their user code?
  • Who can I contact to get more information?
  • How long will it take for them to respond and implement a fix?

In either scenario, you will be stuck twiddling your thumbs while you track the right people down, reconfigure your pipelines to adhere to the new logic, and backfill the data. At this point, bad data is already infecting your automatic processes and data SLAs may have already been missed.

This is the reality for most data engineers. You know there’s a problem in your warehouse, but you don’t know much else. Your data observability starts and ends in the warehouse.

It doesn’t have to be that way.

Real end-to-end data observability

Many data teams don’t know whether their pipelines are fetching all the rows they need in the right form & in the right amount of time. For teams that use Databand, that level of visibility is very straightforward.

Databand connects to pipeline orchestrators like Apache Airflow to give your data team this level of context into their system. By listening to your orchestrator, Databand can collect metadata from your ingestion pipelines and allows your data team to know:

  • whether your pipelines are readings records from your sources
  • how many records are being written into your data lake
  • if the datasets being read & written have the correct schema
  • whether the run duration for this data flow is anomalous
  • which pipelines are reading from your data sources & which pipelines are writing to your lake or warehouse

(Click to enlarge)

This level of visibility allows you to see whether data is being fetched from your sources, whether it’s writing correctly to your target location, if it will arrive in the right form & completeness, and how that will traverse downstream to your data warehouse.

That’s true end-to-end data observability, and it is invaluable for data-intensive organizations.

Time to see it in action

How would this work in practice? Let’s say you are a data engineer for a real estate platform. You’re responsible for serving a critical, customer-facing metric called “Comparative Price Analysis.”

This metric is generated from three external data sources and utilized on the product dashboard. Customers can evaluate how far a listing price is above or below its estimated market value with this metric.

Once the data is fetched from the three APIs, it needs to be unified, structured, and delivered to an S3 location by a certain time every day. From there, the data is pulled by other pipelines automatically into the warehouse for the business unit or the product dashboard.

This is important data for your downstream consumers and your paying customers. You need to know right away if there is a problem. What are your blind spots and how can you use Databand to cover them?

Setting the alerts

For external ingestion processes, a few blindspots that could seriously ruin your day are usually:

  • Schema changes — Did an API change, or was data ingested in the right format?
  • Volume anomalies — Are there missing rows? Duplicate rows?
  • Duration problems — Did the pipeline finish too early? Is it stalling?
  • Status failures — Did a task fail? Did an entire run fail?

Databand makes it very easy to set alerts on your critical pipelines. For this example, they’re called Ingest1, Ingest2, and Ingest3.

Run Status alert

First, let’s set up Run Status alerts. This will give you visibility into whether the whole Run failed.

(Click to enlarge)

You go to the pipeline tab and search for those three pipelines. Once you click on your pipelines, you click the “Add alert definition button,” select the Run Status metric, and set the definition to “Failed.” You’ll now know whether one of your critical ingestion points has failed.

Run Duration

Now, let’s cover the next blind spot. What if the Run doesn’t outright fail? What if it is stuck trying to rerun a task infinitely? The alert wouldn’t fire in this scenario and you would miss your SLA.

You can cover that blind spot with a Duration alert. You have two options here: 1) you can manually set a duration range based on the historical run duration of each pipeline, or 2) you can use anomaly detection to dynamically set an alert threshold.

They both have their advantages and disadvantages. In this circumstance, a metric like Run duration can vary greatly over time. You want to know about large deviations from the norm that could affect your uptime, not every degree of variation. So, anomaly detection is probably best here.

(Click to enlarge)

Databand’s Anomaly Detection triggers alerts based on the predetermined Lookback and Sensitivity settings. Just like that, you have two of your bases covered.

Data volume

What’s another blind spot? Volume. This alert type can help us validate that the pipeline is fetching all the rows in a dataset. You’ll use an Anomaly alert definition here as well since setting strict thresholds can be a little tricky. You’ll set the alert in the same way, but you’ll choose the Input Volume instead.

Data schema

Lastly, you need to know whether a breaking schema change occurs. Normally, this is a complex alert to set. But in Databand, you don’t have to lift a finger.

Databand uses your historical run metadata to identify trends in the dataset schema. Since something like a dataset schema is fairly deterministic, Databand will track the schema based on the implicit shape of the dataset on every run & fire an alert if there is a schema change.

All you have to do is choose the “Schema Change” alert definition. Just like that, Databand just covered your most complicated blind spot!

You now have all of your major bases covered for your three critical ingestion pipelines. In event that one of the conditions are met, an alert will fire to email, Slack, PagerDuty, or whichever alert receiver your organization uses in its operational flow.

An alert fires & diagnosis

You’re going about your day when you suddenly receive a critical severity alert in Slack! Looks like there was a schema change alert from one of your critical ingestion processes. You click the alert in slack, which brings you right into the affected pipeline within Databand.

(Click to enlarge)

This view gives you a high-level context of how the pipeline is performing. Right off the bat, you can see three things: 

  1. There was a big drop in the total records read & written by pipeline Ingest2 for the latest run
  2. You can see a graph of the pipeline’s structure along with Run & Task Durations and Run & Task Status
  3. You can see that the root cause of the schema change was a Type Change

It seems one of your key data providers changed one of the columns in their API without informing you. This caused the ingestion process to break and it puts our data health & SLA at risk. Good thing you found out about it right away. You ping your team and alert them to start working on a fix.

While the team works on a fix, how can you get ahead of this issue? How can you know what teams are depending on this data so you can alert them before bad data from infecting our business processes? Looking a little closer, you can see when data flowing through this pipeline lives in an S3 table called “Comparative_Price_Analysis_Lake.” 

You click on the table to enter the Dataset view.

(Click to enlarge)

From this view, you can see the total number of records being written into this table and how many operators have worked with this table. Right away, you see a dip in the Total Records Added and a slight dip in Total Operations (one to be exact). This confirms that the schema change was breaking and no data was sent to the table. This leaves the dataset with only 2/3rds of its regular completeness and confirms your SLA is at risk.

By hovering over the affected period, you can see which pipelines wrote into this table. You can see that the Ingest1 and Ingest3 pipelines were completed successfully. You can also see which pipelines typically read data from this table.

The pipelines that read from this table are the downstream pipelines that will be affected by this failure if it isn’t fixed in time. Due to good pipeline naming conventions, you can quickly identify which teams you need to alert. Although if your pipeline names aren’t so straightforward, you could click on the pipeline to see who implemented that user pipeline.

Better data observability starts at your sources, not your warehouse

Removing the black box from your ingestion layer with Databand totally changed your operation flow for this kind of issue. When your data observability starts at your warehouse, you are forced into a position where you learn about ingestion issues from your downstream consumers, or you get alerts about data quality issues in your warehouse without any context on what caused the failure & the impact on your consumers.

When you have observability from the moment data enters your system, you can:

  • Begin to work with your team to either adapt the pipeline to work with the new API or finding some other solution before SLAs are missed
  • Alert the downstream teams that there will be a problem with data they depend on so they can pause all automatic processes if possible
  • Keep a record of what types of schema changes are breaking your system, and how you can make your system more fault-tolerant

This is what true end-to-end data observability looks like. Want to see how Databand can revolutionize the way you solve data quality problems? Book a demo to see Databand in action!

Apache Spark Monitoring: How To Use Spark API & Open-Source Libraries To Get Better Data Observability Of Your Application

Read next blog