An insider’s guide to data pipeline incident management

Your objective is to never allow pipelines to break. But of course, they will. So in pursuit of that goal, your secondary objective is to never let things break the same way twice. Your incident management process should lead you to create durable fixes. You want to build a human-error-resistant, fault-tolerant, antifragile system that keeps on delivering data within your data SLA.

In this article, we explore how to use incident management to do more than repair your pipelines and restore service. This is how you turn it into a tool that improves your service continuously.

First order of data pipeline incident management business: Pick your response style

This is an important point. Do you know the difference between a hero and a leader? The hero tries to be the fix. They learn a lot and make themselves irreplaceable but when they aren’t there, nobody knows what to do. Heroes are well-meaning, but create fragile systems that don’t survive without them.

Leaders do the opposite. They try to fix the root cause and make themselves irrelevant.

You can tell the difference between heroes and leaders in the workplace by the fact that heroes are first-in when something breaks—but they are also forever managing incidents. Leaders, on the other hand, like to ask a lot of questions and are less visible. But they also handle fewer and fewer incidents over time.

As you can imagine, the role you choose has a big impact on how much your data pipeline incident management leads to lasting fixes.

Incident management hero Incident management leader
Responds quickly
Wants to know about every incident
Finds crises exciting
Interested in immediate fixes
Satisfied with proximal causes
Savior aura
Worries about credit
Responds thoughtfully
Wants to understand every incident
Publically admits crises are embarrassing
Interested in long-term fixes
Only satisfied with root causes
Mostly invisible
Worries about being correct

Borrow our data pipeline incident management framework

Now on to the solution. Your cloud incident response plan may look different than ours, but most follow the same order and operate on the same principles. That’s because, as we’ll explore, incident management looks surprisingly similar across industries, and there’s a lot to learn from people who do it a lot—like firefighters and search and rescue personnel.

1. First, close your feedback loop

Aim to turn everything that occurs into knowledge and rules. (Also known as creating a knowledge-centered service.) To do that, you either experiment endlessly until you stumble upon fixes, or you implement an observability tool and become obsessed with root cause analysis (RCA).

Our platform, Databand, is a tool that can help you observe and track all your pipelines to identify causes. It integrates with pipeline management and data storage systems like Spark, Airflow, Snowflake, and more to provide metrics and insight those systems don’t natively provide.

For instance, you can:

  • Monitor data quality
  • Trace issue lineage
  • Identify root causes
  • Set alerts for anomalies (in Slack or PagerDuty)
  • Create unified logging
  • Conduct retrospective analyses

The key with an observability tool is, of course, laboring to identify the leading indicators of pipeline error, and setting alerts for them. You can’t catch every issue before it happens, but if you fully understand each issue, you can create alerts for earlier and earlier issues so bad data never even makes it into your data warehouse.

2. Document your cloud incident response plan

Write down what precisely happens when a data pipeline incident occurs, and who is responsible for what, in an internal wiki or Google Doc.

You can look to real-life emergency workers for guidance. First responders in the U.S. use a framework known as the Incident Command System (ICS), developed in response to a series of devastating wildfires in California in the 1970s. It’s designed to be flexible and allow different groups and agencies to interoperate during a crisis.

In ICS, the first person to arrive at the incident scene is deemed the incident commander (IC). They inherit every possible hat, like communications, logistics, and personnel. As others show up, they hand out hats. If someone new outranks them, they hand off the commander’s hat. The framework tells everyone exactly what their role is as they work toward a solution.

In your version of this, define your response roles and responsibilities, like an incident commander, head of comms, and so on. And, here is the key: Be sure people understand them before an emergency occurs. If you’re serious about your data SLA, run test drills.

Here’s an example of a data pipeline incident management command process:

  1. Detect the error early—an observability tool can help you identify leading rather than lagging indicators of error. (The worst lagging indicator is a user complaint.)
  2. Delegate roles, if relevant—emergency first responders are trained to first point to a bystander and tell them to call 911. If you’re first on the scene, assign the second responder to communicate the impact while you work on resolving the issue.
  3. Communicate the impact—let data consumers know to stop using the data until you apply a fix. Define what channel you’ll provide these updates on ahead of time, and make sure people know to look there.
  4. Resolve the issue—investigate the error; balance the need for a quick fix with understanding the root cause.
  5. Conduct a post-mortem—don’t stop digging until you identify the root cause. Otherwise, it will reoccur, and one crisis becomes a dozen. (This is much easier to do if you have an observability tool that records past states so you can conduct retrospective post-mortems without fear of losing anything.)
  6. Document things in an incident state document—record everything that occurred in a searchable log or ticket response tool like Jira or ServiceNow.

During the post-mortem, it’s very important you avoid assigning blame. Blame does nothing but teach people to bury issues and deflect questions. To fire someone because of a mistake is the craziest thing of all—you’re getting rid of the one person who knows what not to do. Create a culture that recognizes that everyone who causes an error has discovered a bug and done the team a service. Your goal is to build a durable system, not to never make a mistake.

incident management best practices

During a post-mortem, Amazon’s data engineering team asks five questions to understand precisely what happened and what to do about it (edited for brevity):

  • Why was this incident possible?
  • Is the current reason the root cause?
  • Could it have been detected earlier?
  • Could it have been prevented?
  • If the cause was human error, why was it possible?

And of course, if something can be prevented, it should.

(The Google incident response whitepaper is also worth a read.)

3. Communicate decisively to data consumers and stakeholders

Data doesn’t just have to be accurate—it has to be trusted. One real estate AI firm learned this the hard way.

They produced a truly impressive machine learning algorithm that helped retail officers at big, casual fast-food chains pick their next store location. The product drew data from traffic cameras, cell towers, weather providers, and more—and in the end when they showed it to retail officers who snubbed it. “I know better than the machine,” they’d say, “I’ve been doing this for 40 years and that’s not a good location.” All the data in the world didn’t matter. Users didn’t trust it.

Data engineering teams face a similar challenge with incident management. For people to believe that the dashboard and data are true, and to use it to make decisions, they have to trust you. So, a massive part of building trust is conducting data pipeline incident management like a pro.

Some communication principles that signal to teammates, “You can trust us”:

  • Communicate early—if someone has to ask you, they won’t think you’re on top of it.
  • Communicate clearly—leave no room for ambiguity. Instead of saying “in a while,” estimate a time frame. Instead of saying, “Please help,” explain precisely what you want others to do, and why. (If you explain why, people are more likely to comply.)
  • Don’t sugarcoat it—if, one time, it’s worse than you said, they’ll always expect the worst.
  • Fall on your sword when needed—accepting blame makes you human. Avoiding blame only makes people suspicious.
  • Respond to questions quickly—even if it’s to say it’s too early to tell.

One great way to get out ahead of many questions is to publish a data SLA, or a public promise to deliver data within certain quality or time parameters.

Another great way is to make your incident management communications clear and perhaps automatic. Can you use errors in your observability platform to automatically create a notification on users’ dashboards?

4. Implement what you learn

Update your data pipeline incident plan document frequently. It’s a living document, and you’re forever trying to improve it and your data pipelines so incidents occur with less frequency and less severity.

Then, create or adjust alerts in your observability tool. Below is an example using Databand.

Ongoing, it’s not a bad idea to bake periods of downtime and maintenance into your data pipeline management process. During those periods, you can stress test them, or pass unusual data through the pipeline to see if you can break it. Everything you learn can save you from having to respond to actual incidents.

And as you build new systems, keep them modular. It’s always a good idea to separate responsibilities so you can test individual components and isolate issues more quickly.

The key to incident management: Monitor, monitor, monitor

If there’s one thing we’d like to stress about this process, it’s the absolute necessity of observability. Nothing else works if you can’t quickly and reliably detect errors, isolate them, and understand root causes. The same way you unit test software, you should be unit testing your pipeline, continuously. It’s the best way to conduct data pipeline incident management—in a way that builds a system that experiences fewer and fewer errors.

And if you do a really good job, you’ll respond to fewer and fewer incidents. Which, if you’ve been acting like a leader and not a hero, is perfectly alright with you.