An 11-point checklist for setting and hitting data SLAs (with an SLA template)

Databand
2021-08-11 11:22:51

We’d venture to say that no team is too small to come up with and commit to a data service level agreement, or data SLA. What is a data SLA? It’s a public promise to deliver a quantifiable level of service. Just like your infrastructure as a service (IaaS) providers commit to 99.99% uptime, it’s you committing to provide data of a certain quality, within certain parameters. 

It’s important that the commitment is public. (Within the company, at least.) Publicity creates better accountability, helps you get all teams aligned around what’s most important, and allows you to build a structure that supports the quality. 

In this guide, we explore how to establish your own data SLA.

Data SLAs reduce disagreement and create clarity

Formalized, written data SLAs make your informal commitments concrete and mutually agreeable. Every data relationship involves informal commitments, whether you state them or not, and very often, two parties can agree to something without realizing they’re talking about different things. 

For example, “Within a reasonable time frame” has very different meanings to each department, or even to each individual. For some, it means a week. For others, it’s a quarter. For salespeople, it’s before their next client meeting.

Informal commitments tend to only be as strong as each person’s memory. It’s not uncommon for a data engineering team to informally commit to delivering data within a few weeks, and for the downstream internal “consumers” to simply say, “Thanks.” But then, a week later, those consumers demand to know where the data is, given they’re about to walk into an executive meeting. It’s in those moments you realize they had unvoiced expectations which would have been useful to document.

And if the agreements are merely verbal, they can twist and transform when something goes wrong. If an executive demands something of one of your data consumers, their emergency becomes your emergency. They need it now. Or if a prospect demands to see a sample data set, suddenly salespeople will believe you should be responding to requests same-day.

Formal data SLAs can help with all that. They help you explain to others how you work to achieve your ultimate purpose: data trust. You want everyone in the organization to trust you, and by extension, the data.

You can borrow this data service level agreement template

So what exactly is the data SLA? It’s a simple written document, usually of 250-500 words, posted in a shared space like a company wiki or Google Doc. It should include six elements:

  • Purpose: Why does this data SLA exist? What issues do you expect it to solve, and how do you hope it is used?
  • Promise: What are you promising to other teams? 
  • Measurement: How will you measure the data SLA, who will measure it, and what’s the SLA time frame?
  • Ramifications: What happens when you miss your data SLA? Who is responsible and what sort of remediations are available, if any?
  • Requirements: What do you expect in return? How are your promises conditional? 
  • Signatures: Who is committing to the data SLA?

When writing your data SLA, convey it in as few words as possible without changing the meaning. This requires lots of editing, but we recommend writing it all in one messy pass and returning to edit later. The reason is, if you stare at the page too long, you may develop what writers call “blank page anxiety” and keep putting it off. Punch out a poor-quality draft now—do not wait.

Here is a data service level agreement example:

Company Data Engineering SLA

The purpose of this document is to establish a public promise from our team to others to maintain high data quality within precise parameters. Our hope is it will create understanding, help us all work together, and keep our teams mutually accountable.

Our promise: We’ll deliver sales data with a data quality score of at least 95% by 5:00 am ET every day so the team can answer questions like “What were sales yesterday?” We’ll acknowledge all requests within one business day and sort them by simple and complex tickets. We’ll resolve simple requests within three business days and complex requests within two weeks.

We’ll measure data quality by comparing data delivery KPIs like Run Start Time and Run Complete Time, Record Count and ratio of Null to Record Count, and distribution and drift scores with the predefined standards for data freshness, data completeness, and data fidelity. 

If we miss a data SLA, within three business days, our team will post a public apology taking credit, explaining why it happened and precise measures we’re putting in place to fix it.

In order to fulfill this promise, we need your help. Our team needs timely direction, input, and clear feedback on how the data is being used, as well as at least four weeks’ notice of any complex requested changes.

Please direct all questions, comments, and concerns to [email protected]. (But you can direct all praise and flattery to [email protected] 😉.)

With resolve,

– Your Data Engineering Team

11 strategies for hitting your data SLA

With your SLA in place (or perhaps while you’re editing it), start thinking about all the things you need to put into place before you can hold yourself to it. 

For example:

1. Define what “good data” means

Try to wring as much ambiguity out of this phrase as possible. Define it in concrete and unmistakable terms. As we see it, there are four characteristics you can use to define high-quality data. Once defined, secure other teams’ agreement on that definition. 

Ask yourself:

  • What is the outcome of good data for the business?
  • What unique characteristics define good data? 
  • What characteristics define bad data?

2. Track whether the data is available

For tracking, you’ll need an observability tool so you actually know if parts of your pipeline are down. Without one, it’s pretty tough to measure whether you’re missing an SLA, much less diagnose the root cause. It’ll also help you understand errors so you can fix things far faster. 

You can treat your data SLA like a North Star metric—one focal point to guide everyone. But within it, there’s of course a lot of concealed complexity, and you’ll need to track a basket of KPIs to help you know what’s happening upstream and downstream. 

Here are a few specific recommendations: 

  1. Set automatic tests to monitor data quality on its four dimensions
    • Test data pre-production
    • Test at each stage: completeness, anomalies
  2. Measure how well you discover, respond to, and address issues
    • Time to discovery
    • Time to resolution
    • Incidents per asset
  3. Document the proximate and root causes of every issue
    • Data partner missed a delivery
    • Time out
    • Job stuck in a queue
    • Unexpected transformation
    • Permission issue
    • Runtime error
    • Schedule changes

3. Identify the infrastructure you’ll need to add

Be cautious about what you commit to. You can’t be everywhere and prepare for everything, and an SLA of 99.999% uptime means you can only have five minutes of downtime each year. To deliver on that, you’d probably need more headcount, more visibility, more redundancies, and people working around the clock.

4. Implement issue tracking and reporting

You’ll probably need a ticketing tool like Jira or ServiceNow. This allows data users to create tickets, your team to track them, and you to understand the nature of those tickets so you can come up with long-term fixes and identify trouble areas. 

5. Define data owners

You may not want to specify it in your public data SLA document, but define data source and pipeline owners. They’re the ones ultimately responsible if something goes wrong. Also specify what happens if they go on vacation or leave the company.

6. Set up alerts

Set up alerts to post in your team messaging app such as Slack or an incident management system like PagerDuty. The more incident detail you can pack into that alert, the faster you can diagnose. These alerts will tell you early who else you’ll need to bring in, or where to begin your analysis. (Databand can send these alerts, and appends useful insights and context.)

7. Publish a team incident response plan

Let’s say a data consumer tells you a table is broken on their dashboard. How do you confirm and respond? Write it out so when an incident occurs, you don’t run into the bystander problem, where everyone assumes someone else will handle it, and then nobody acts. 

Depending on the size of your team, and how you’re distributed around the world, you may want to take this very seriously, and appoint what emergency responders call an incident commander. That person becomes the CEO of the incident and directs all others. (This ensures a coordinated response and helps you avoid multiple people tackling the same issue.)

8. Communicate issues with in-app alerts

If you’re able, create alert panels on people’s dashboards so you can communicate the status of the system. If something goes wrong, you can write, “We’re having an outage—here’s our estimated time to resolution.” This will diffuse repeated alerts from all your data consumers, and free you to actually respond.

If you can’t create alert panels, at the very least, designate a key person on each team who you can tell, who’ll then tell all the others. 

9. Monitor and update

Monitor how your data consumers are using the data (and whether they’re using the data.) Conduct occasional surveys, formal or informal, to gauge their trust in that data, and invite suggestions. For consumers who are interested, communicate what’s on your roadmap.

10. Conduct periodic maintenance

Set periodic maintenance periods where your team reviews why things broke and brainstorm fixes. Ask why those issues were possible, conduct a no-fault post-mortem, document your findings, assign those fixes, and monitor how they worked. 

11. Publish your data SLA

With all that figured out, you’re ready to edit and revise your data SLA. Publish it publicly in your company wiki or somewhere shared, secure everybody’s commitment, and hold yourself to it.

Hitting your data SLAs

Data SLAs help you keep yourself and your team honest. While they’re phrased as a public promise to others, they’re really a bilateral agreement—you agree to provide data within specific parameters, but in return, you need people’s participation and their understanding. 

Lots can go wrong in data engineering and lots of it has to do with miscommunication. Documenting your SLA goes a long way toward clearing it all up, so you can achieve your ultimate goal: instilling greater data trust within your organization.

Apache Spark use cases for DataOps in 2021

Read next blog