Anomaly detection can be a great tool for when your data pipelines have strict deadlines for delivery. The problem is actually implementing a good system for detecting the leading indicators of failure.
Today, data engineers are dealing with a tangled web of inherited architecture—many of which lack up-to-date documentation. Those siloed pipelines make it difficult to report customized metrics tracking so you can measure data health, identify trends, and set anomalous thresholds.
The result of this siloed observability is messy. Pipelines stall, deliver inaccurate datasets, and leave no way to track where that data will be used. Even worse, those downstream consumers weren’t aware that there was a problem with their dataset. Without that visibility, it’s really difficult to measure the true cost of that bad data as it infects your business processes.
We created Databand to help data teams cut the Gordian Knot of meeting the data SLAs they set. There are a lot of ways Databand can help you do this. But in this article, you’ll see how Databand’s Anomaly Detection can identify data pipeline issues that affect your data uptime or fidelity early.
Anomaly Detection for Data Engineers
First things, first: what is anomaly detection, and why is that important for data engineers?
Anomaly detection identifies points, events, or values that deviate from a dataset’s normal behavior. This includes the operational performance of pipelines, infrastructural components, and data quality KPIs.
Anomalous data can signal critical incidents happening under the hood, such as an infrastructure failure, a breaking change from an upstream source, and opportunities for architectural optimization. While people are great at identifying trends when given the right information, most data organizations don’t have a scalable way to deliver that information and watch for anomalies 24/7.
We saw an opportunity to meet that need by centralizing your pipeline metadata and analyzing it with ML-powered Anomaly Detection.
Databand’s Anomaly Detection can help you highlight the metrics that matter the most for on-time data delivery so your consumers can have greater trust in your data. Anomaly Detection can help data engineers:
- Identify possible pipeline issues earlier
- Improve root cause analysis
- Fix data health issues before it’s too late
- Reduce threats to data architecture
By setting up anomaly detection in Databand, you can get alerts on custom or out-of-the-box metrics like run durations, task duration, input count, and output count when they cross a certain threshold.
Once alerted, you are brought straight to the affected pipeline so you can begin working on a resolution before the consequences of that anomaly reach your consumers.
Not only can you use anomaly detection on an uncompleted run (eg: not completed due to stalling task), but you can also use anomaly detection on completed runs to retrospectively investigate why problematic data health anomalies occurred. This puts data engineers in a position where they can proactively observe their end-to-end pipelines.
Though that description sums up the benefits, the true power of this feature really shines once you bring it down to earth with a real example.
Protecting analytical integrity with Databand’s Anomaly Detection
Anomaly detection can save organizations money and build trust between internal and external consumers. The best way to highlight this is by showing the feature in action in the real world.
For our example, we’ll be focusing on an agricultural technology organization that helps large-scale farming operations optimize their crop resource consumption. This organization installs sensors in the soil near each crop and on tractor implements to give machinery operators the ability to more efficiently allocate water, herbicide, pesticide, and plant nutrients to areas of the field that need it—rather than blanketing each field in the said resource.
The farm operators receive a report every week so they can have a good idea of how much of their inventory will be needed for each area of their field.
Every week on Friday, an analyst needs to run this report and have it delivered to the farm operators by 12 PM. Since the report takes an hour to run, the analyst runs the report at 11 AM.
This process starts with the ingestion phase—which is owned by you, a data engineer. For our example, you are responsible for ensuring that complete, accurate, and fresh data is available for the analyst every Friday by 11 AM.
To make sure that the data is as accurate and fresh as possible, you need to start moving that data from the source to the warehouse early Friday morning. That entire process from ingestion to readying for query typically takes around 3 hours. This means the absolute latest this process can be rerun, if needed, is 8 AM.
Setting up Anomaly Detection on Run durations
Due to how critical this data is for the organization’s consumers, you want to create as many safeguards as possible (and rightly so). The first thing you do is schedule the run to start at 4 AM so it can finish by 7 AM, giving you an extra hour for diagnosis and debugging when necessary.
Next, you elect to activate Anomaly Detection on this pipeline due to the regularity in its performance and the dependencies reliant on that regularity.
You already know that run duration is a great indicator of the pipeline’s health. Just so you can be sure, you go into Databand Dashboard to ensure that the trend in run duration hasn’t changed over the past couple of months.
Sure enough, it looks like the Run takes no more than three hours when it’s healthy.
You activate anomaly detection on the Run duration. This allows you to receive an alert when the Run duration crosses the threshold from “normal” to “anomalous” behavior. In this case, you’ll be alerted if the run hasn’t been completed by 7:15 AM, and you’ll know there might be an issue that would cause a missed delivery.
Are there any other blind spots you could be missing? What if there is a problem with the data ingestion and that affects the input size? What if the ingestion performs normally, but data is duplicated or corrupted as it transforms and makes its way to the warehouse? The alert would not trigger, but the report would be inaccurate.
Thankfully, you can use anomaly detection to account for that blind spot as well. Just as before, you go to the Databand Dashboard to get an idea of the input size. Since there are a fixed amount of sensors collecting data for a set amount of time, there is some uniformity in the amount of data collected each week.
In the same fashion, you can set up Anomaly detection on the ingestion tasks to detect an anomalous input size, and on the transformation and aggregation tasks output size.
At this point, you have four safeguards set up:
- The run is scheduled in a way that gives room for debugging
- Anomaly detection on the run time
- Anomaly detection on the ingestion task’s input size
- Anomaly detection on the run’s output size
This gives you better coverage into data health issues related to runtime, and the chance to fix those issues before they are utilized by downstream consumers.
Now, let’s see how this plays out.
Using Databand’s Anomaly Detection to fix data delivery issues
The next morning, an alert is pushed to you through Slack.
“Anomalous Run Duration on service311_model_evaluation, Data115 may be delivered late!”
Not good. That’s the data that we need for the report! If the data isn’t delivered on time, the farm operators will get a report based on last week’s data. You need to fix this fast.
You click the alert and Databand brings you directly to the affected pipeline. Right away, you can tell one of the tasks within the run seems to be stalling. Good thing you gave yourself that buffer window for debugging.
You click on the stalling task to view the logs. It seems that the cause of the stall was a cluster provisioning failure. Based on the reason for stalling and assuming the other tasks will complete as expected, we should be able to rerun the pipeline now and have the data ready for query by 10:30 AM!
In just a few minutes, Databand’s Anomaly Detection enabled you to:
- Get advanced notice of a probable SLA miss
- Identify the proximate cause (task stalling) and root cause (cluster provisioning failure) for the probable SLA miss
- Fix the problem and rerun the pipeline before the delivery deadline
Looks like you saved the day with Databand’s help. Although like most great data engineering, your consumers won’t be aware of the awesome work you just did. This is a thankless job after all.
Retrospectively improving pipelines with Anomaly Detection
Great engineers don’t stop when they find a solution. They want to know why it happened, and how to prevent it from happening again.
Databand allows engineers to retrospectively investigate anomalies so they can optimize their pipeline performance for future jobs. You can use the dashboarding function of Anomaly Detection to get a good idea of trends in your pipeline’s historical metadata.
For example, what do normal conditions look for the tasks that were preceding the problematic task? What do those conditions look like when the task is stalling? Is there any kind of correlation between the two? It’s easy to compare those two scenarios by using the anomaly detection view in Databand Dashboard.
Also, for our example, we have an SLA with this farm that indicates that data must be at least 95% accurate. Databand might be measuring data accuracy at 96%, but we want to increase this figure to 99%. In the best-case scenario, this would give us an edge on our competition and increase the ROI of our data product, and—in the worst-case scenario—would provide us with some buffer on data fidelity if some sporadic error occurred.
We can achieve this in Databand by comparing trends in data volume over time with our figure of data accuracy. Fluctuations in data volume can mean a lot of different things. In this case, low volume might mean data is missing, while high data volume may indicate duplicate data.
Manually setting alerting thresholds on these would be tedious, time-consuming, and likely inaccurate. That’s because data volume may fluctuate due to a variety of external factors, like crops planted, tractor implements in service, and harvests and seasonality.
Databand will automatically adjust thresholds for anomaly detection based on these historical fluctuations, and make it easier to spot true anomalies in the values.
Fix upstream issues before they become downstream problems
Databand’s Anomaly Detection is a powerful tool that can help you catch and fix data uptime and data fidelity issues. With Databand’s Anomaly Detection you can:
- Spot possible data delivery problems early
- Improve root cause analysis
- Expedite remediation
- Identify trends on conditions that affect data quality
- Investigate pipeline’s historical performance
- Discover opportunities to optimize pipeline performance
Are you interested in seeing how Databand’s Anomaly Detection can guarantee your data SLAs? Book a demo to see Databand.ai in action!