Avoid data SLA misses with Databand Dashboard
Business leaders want higher data quality and on-time data delivery. While your organization might not yet have an explicit data SLA, at some level data engineers will be responsible for making sure good data is delivered on time. At Databand, we want to help data teams meet the data SLAs they set for themselves, and create trust in their data products. We consider four main areas as critical to a data SLA:
- Uptime — Is expected data being delivered on time?
- Completeness — Is all the data expected to arrive in the right form?
- Fidelity — Is accurate, “true” data being delivered?
- Remediation — How quickly are any of the above data SLA issues detected and resolved?
Databand.ai can help data-driven organizations improve in all of these areas. For the first article in this series, you’ll be exploring how data pipeline failures affect your uptime and data SLAs.
Data pipeline health isn’t a binary question of job success or failure
Organizations can know they have data health problems, without knowing how those problems actually map to events in their pipelines or attributes of the data itself. This puts organizations in a reactive position in relation to their data SLAs.
The problem described is an observability problem, and it stems from the inability to see the context of pipeline performance due to a fractured and incomplete view of their data delivery. If you are only looking at success/failure counts to understand pipeline health, you may miss critical problems that affect your data SLAs (like uptime), for example, a task running late causing a missed data delivery, and how that might cascade to broader issues.
At Databand, we believe data observability goes deeper than monitoring by adding more context to system metrics, providing a deeper view of system operations, and indicating whether engineers need to step in and apply a fix.
Observability for production data pipelines is hard, and it’s only getting harder. As companies become more data-focused, the data infrastructure they use becomes more sophisticated. This increased complexity has caused pipeline failures to become more common and more expensive.
Data Observability within organizations is fractured for a variety of reasons. Pipelines interact with multiple systems and environments. Each system has its own monitoring in place. On top of that, different data teams in your organization might have ownership over parts of your stack.
Databand Dashboard: a unified solution for guaranteeing data SLAs
We developed Databand Dashboard to help data engineers gain full observability on their data and monitor data quality across its entire journey. It’s easier than ever to find leading indicators and root causes of pipeline failures that can prevent on-time delivery. Whether your data flows are passing through Spark, Snowflake, Airflow, Kubernetes, or other tools, you can do it all in one place.
- Alerts on performance and efficiency bottlenecks before they affect data delivery
- Unified view your pipeline health including logs, errors, and data quality metrics
- Seamless connection to your data stack
- Customizable metrics and dashboards
- Fast root cause analysis to resolve issues when they are found
A single entry point for all pipeline-related issues
A pipeline can fail in multiple ways and the goal of Databand’s dashboard is to help engineers quickly categorize and prioritize issues so that you meet your data delivery SLAs. Here are some examples of some of these issues that might appear:
- Bad data causing pipelines to fail
Common example: wrong schema or wrong value leading to failure to read the data — and, as a result, a total task failure
- Failure related to pipeline logic
Common example: new version of the pipeline has a bug in task source code causing failures in production run
- Resource related issues
Common example: failure to provision an Apache Spark cluster or lack of available memory
- Orchestrator System Failure, issues related to cluster health
Common example: a failure of scheduler to schedule job
Triage failure and highlight what matters most to your data delivery
It’s difficult to prioritize which data pipeline failures to focus on. Especially when there are many things happening across your entire system at once.
The Dashboard can plot all pipelines and runs together over a pre-configured time with statuses and metrics — allowing you to visualize the urgency and dependencies of each issue and tackle them accordingly. After detection, you can dive into specific runs within your pipelines. You can observe statuses, errors, user metrics, and logs. You’ll be able to see exactly what is causing that failure, whether it’s an application code error, data expectation problem, or slow performance. This way, your DataOps team can begin working on remediation as quickly as possible.
Let’s explore an example:
Jumping into our Dashboard, Databand tells us that there’s been a spike of failed jobs that have started around 3:00 am the prior night.
In our aggregate view, we see it’s not one pipeline failing but rather multiple pipelines failing, and a visualization of a spike in failed jobs at specific points in time makes this clear. This is a sign of a system failure, and we need to analyze the errors happening at this point in time to get to a root cause.
This is a big deal because, with all these failures, we know we’ll have critical data delivery misses. Luckily, Databand can show us the impact of these failures on those missed data deliveries (which tables or files will not be created).
Now, you know you have an issue! How can you fix it? Can we quickly remediate?
To get to the root cause of the problem, you filter the dashboard to a relevant time frame and check for the most common errors using an Error widget.
The most common error across your pipelines is a Spark “out of memory” error.
This tells you that the root cause of system failure is an under provisioning of Spark cluster.
Rather than spending hours manually reading logs — and possibly days trying to find a root cause — Databand helped you grouped multiple co-occurring errors, diagnosed, identified a root cause. Most importantly, you had the context for a possible solution in just a few minutes.
Databand saved you precious time so that you could expedite the remediation process without breaching your Data SLA.
Debug problems and get to resolution fast
When a pipeline fails, engineers need to identify resolutions fast if they want to prevent late data delivery and an SLA breach. The Dashboard gives a data engineer the proper context to the problem and focuses their debugging efforts.
As soon as a run failure happens, Databand sends you an alert, bringing you to the proper dashboard, where you can see how the failure will impact deliveries, what errors the failure relates to, and whether the error correlates with issues occurring across pipelines, indicating a contained or system-wide problem.
Using “Runs by Start Time”, our dashboard will enable us to understand if errors are specific to runs, or spread across pipelines (like a system-wide network issue). For errors detected across any runs, we can open the logs to understand the source, whether it’s our application code, underlying execution system (like Apache Spark), or an issue related to the data.
By tracing our failures to the specific cause, we can quickly resolve problems and get the pipeline back on track, so that we can recover quickly from a missed data delivery or avoid it altogether.
Make sure changes fix your problem, without creating new ones
As with any software process, when we make a change to our code (in this case pipeline code) we need to make sure the change is tested before we push it to our production system.
Unlike software processes, it’s a big pain to test data pipeline changes. There are simply too many factors to take into account – the code changes, the different stages of the pipeline, and the data flow to name a few.
When testing changes, one problem teams often face is the difficulty of comparing results between test and production environments.
With a consolidated view on all pipelines across any environment, Databand makes this easy. Offering a better way to perform quality control on pipelines, the moment before the pipeline is pushed to production so you can decrease the risk of yet another failure and a worse SLA miss.
By selecting across multiple source systems, Databand enables you to compare metrics that are critical to your data delivery such as run durations, data quality measures, and possible errors.
Get a bird’s eye view of your data infrastructure
Detect failure trends and prevent future ones
The Databand Dashboard is a powerful tool that will help DataOps teams guarantee their data SLA. With Databand Dashboard you can:
- Fix pipeline issues proactively and ensure on-time data delivery
- Unify the monitoring of all your pipeline across its entire journey from dev to production
- Determine the root cause of pipeline issues fast
- Compare runs from staging and production environments with ease
- Ensure the health of your computation clusters
We’ve just scratched the surface of what Dashboard is capable of.
In a next post, we will talk about how Dashboard is used to do retros and how favorited metrics can be used to track statuses of important data assets.
Understanding the what and why behind pipeline failure is important. However, our ultimate goal is to catch problems before they happen so that engineering teams can focus on making their infrastructure more efficient — rather than be stuck in that state of costly damage control.
The Databand Dashboard helps you understand what should be considered an anomalous duration. An example is an abnormally long run, or long-running tasks, that keep the rest of the pipeline waiting in a queue for resources. While pipeline stats show the average run time for the last run, Dashboard’s charts can show what is the duration of a currently running job.
Situations like these can normally only be caught with tedious, manual monitoring. Databand automatically tracks these metrics and will send you an alert so you can fix the issue before your delivery runs late. You can set an alert on the duration of a specific task or run. When the duration exceeds an alert threshold, an alert would be sent to slack, email or incident management systems like PagerDuty.
Get your first month of unified data observability for free
Get a free month of end-to-end data pipeline observability when you schedule a demo and start implementing your first health check.