Data SLAs & data products: how they’re related & why it matters
The data industry is going through another transformative period as industry leaders unearth a massively underutilized asset: their own data. No longer a simple means to an end, data is being batched, streamed, transformed, and collected by practically every interface we interact with on a daily basis. In turn, organizations are adopting data-driven cultures and rapidly building new data teams to mine this new resource. The ultimate goal, you may ask? To go where no data has gone before: advanced analytics.
Yet, how can these organizations measure their capability to deliver value from these nascent data products? Enter the Data SLA. While a more obscure, but highly related, concept than data itself — Data SLAs are the mechanisms bringing tangibility to the data ideas across this new frontier. Data SLAs are the practical application for achieving DataOps efficiency within the Data-as-a-Service framework.
In this article, we’ll explore the relationship between data products and data SLAs and discuss some simple steps you can take towards implementing realistic Data SLAs in your organization.
What is a data product?
As with most epic journeys, our data adventure begins with an innocent question: How do we create value from all of this data we are storing? For many organizations, there isn’t a straightforward answer. Most data that is collected ends up being inaccessible or forgotten, let alone cataloged in a manner conducive to analysis.
More often than not, green field data projects tend to be an expensive proposition. Between the projected engineering, infrastructure, and maintenance costs associated with collecting, processing, and storing data, and a perceived high failure rate, data initiatives tend to not pass ‘Go’, nor ‘Collect $200’.
Most executives in organizations cite “people challenges” as the foremost barrier to addressing this problem, not “technological barriers”. While a shortage of specialized talent for these roles contributes to this “people” problem, it’s more of a philosophical (see: mindset) issue. Rather than the continual number crunching, the savviest of these executives will begin to weigh those future costs against the current opportunity cost of the petabytes of data sitting idly by with seemingly endless untapped potential.
This shift in mindset is key to driving innovation within the industry. The last shift in the industry was a technological one which supports the executive’s opinion that the problem goes deeper. That technological shift was the addition of the storage layer; from data warehouses that handled both storage and computation to a domain-agnostic data lake.
To give some context, organizations were collecting heaps of data for all different types of teams and use cases. With computation and storage happening in the same place, data schemas became siloed and there was a mass duplication of efforts due to a lack of proper governance. With no scalable way to organize and make use of that data, data warehouses became an overcrowded mess.
Leaders in the data industry began to make the switch from a data warehouse first approach to a dedicated mass storage layer, the data lake. The original intention was for the data lake to store massive amounts of “raw” domain-agnostic data. Then, the data engineers could assist teams and departments in transforming that data into a workable form from the data warehouse. Eventually, data lakes suffered from the same problem. While computation might not happen in this layer, there was still no clear “operating procedure” between data providers and consumers.
Who would pipeline data from the lake to the warehouse? Who would pipeline data from the warehouse to the consumer? Who would be responsible for cleaning the data at the different stages? Who is responsible for debugging different stages of the pipeline? What does “better data quality” mean to the consumer? What is realistically achievable according to the providers? If you want to break down organizational silos, you need to answer those questions, and technology can’t answer them for you.
By definition, data products need a data SLA
The data-as-a-product model intends to mend the gap that the data lake left open. In this philosophy, company data is viewed as a product that will be consumed by internal and external stakeholders. The data team’s role is to provide that data to the company in ways that promote efficiency, good user experience, and good decision making.
As such, the data providers and data consumers need to work together to answer the questions put forward above. Coming to an agreement on those terms and spelling it out is called a data SLA.
An SLA stands for a service-level agreement. An SLA is a contract between two parties that defines and measures the level of service a given vendor or product will deliver as well as remedies if they fail to deliver. Basically, they are an attempt to define expectations of the level of service and quality between providers and consumers.
They’re very common when an organization is offering a product or service to an external customer or stakeholder, but they can also be used between internal teams within an organization.
In a similar fashion, a data SLA refers to the quality and accessibility of your organization’s data. This acts as a contract between the data team (not just data engineering) and the consumer (internal or external). This SLA doesn’t need to be written down (but we do provide a template for one here) but it could help. At the very least, an unwritten SLA should take the form of a conversation between both parties where they outline clear and measurable expectations for the data products they deliver or consume.
SLAs shouldn’t leave room for interpretation. For example, instead of “We need to understand how users are interacting with our system,” they need to provide a specific, actionable requirement, such as: “We need a report showing which pages of the application a user views, how long the user remains on that page, the actions the user takes on that page, and how often the user performs these actions or views on a per-day basis.”
This level of detail provides data teams with a goal by which they can measure progress, begin breaking down organizational silos, and usher in a culture of data governance.
Defining and setting your data SLA
Creating an SLA is going to be different for every organization. That said, a data SLA is typically made up of four elements:
- Uptime–Is my data up to date and accessible?
- Example: You have a table in Snowflake or a file in S3 that an analyst depends on daily. Is that data arriving on time and in the right place?
- Completeness–Is all the data arriving in the right format?
- Example: The data schemas, record counts, Null count within a dataset needs to be within an expected range to be considered workable.
- Fidelity–Is the data accurate?
- Example: Does this dashboard represent reality? The result of the computation may be correct, but is the data true or did something go wrong (like pipelining data from the incorrect source).
- Remediation–If we miss any of these standards, how quickly do we recover?
- Example: How long does it take engineering to identify a problem, find the root cause, and fix it? How long of a window is acceptable before there is some kind of penalty?
Defining your data SLA is all about compromise
What elements you need to be optimizing your data SLA for will vary based on the overall business objectives of the organization and the needs of the individual consumer. In the ideal world, you would be able to guarantee 99.9999% uptime, fidelity, and completeness and if things did go haywire, you would be able to dedicate your entire engineering team to fixing it.
Unfortunately, we’re bound by the Data Quality Iron Triangle. What do I mean by that? When you begin optimizing for one aspect of your SLA, that will have an affect on your ability to maintain the others.
So, when you are optimizing your data product in one area, it is going to pull your data product from the center of the Iron Triangle. This changes the definition of the mutually agreeable definition “good” data quality for your SLA.
For example, the consumer may accept a more lenient data uptime of 90% each month as long as they are guaranteed more generous remediation terms (i.e. a dedicated support until issue is resolved or budget reallocation) if downtime exceeds the acceptable threshold.
How to find the right balance
There is some balance between these three axioms that work for you and your consumers. Finding that balance means having a very candid conversation with your consumers and figuring out how to balance their needs with what you can realistically provide. Here are some open-ended questions to guide SLA conversations between you and your consumers:
What’s the granularity of our SLA? Company-wide SLA or a dataset-specific SLA?
The goal of this article is to encourage every organization to set an SLA between providers and consumers. That said, not every dataset requires a custom-tailored data SLA. Just like in alerting, setting a unique SLA on every dataset being delivered will spread data engineering thin, create alerting fatigue, and risks the entire exercise as being viewed as idealistic (read as: unrealistic) and the stakeholders will go back to business as usual.
The best place to start is determining which data deliveries require a specialized, more stringent data SLA. Every pipeline has a different level of importance based on the data it delivers. That level of importance could help you narrow down which of your datasets and consumers will need their own data SLA.
What are some KPIs of our data product?
While every one of your consumers would like “better data quality”, they need to be able to quantify what that means to them. Most teams will have more than one KPI for their dataset, but for simplicity purposes, an example could be a team whose KPI might be data completeness.
Once you know that, you can begin to map those data KPIs with your architecture’s performance and data health metrics. For our example, the performance indicator for that objective could be data health metrics like Null Count or Record Count.
You’ve just identified your first SLI (Service-level Indicator) and a rough SLO (Service-level Objective) required to form your data SLA. SLOs are your performance target goals. In this case, that was data completeness. SLIs are simply what you would use to measure that objective. For our example, that was Null Count or Record Count.
To make that SLO useful, we need to further refine it. What does “good” data completeness look like? After a conversation, you might discover that having 99.9999% data completeness would be ideal for them. Now, with a concrete SLO, you’ll be able to calculate exactly what your error budget in Null Count and Record Count will be to hit your objective.
This exercise helps you transform a vague request like “better data quality” into something quantifiable, measurable, and attainable. And these SLOs and SLIs become the building blocks of your larger SLA.
How realistic are the expectations of your data consumers?
The goal of creating an SLA isn’t to make just the consumers happy. It’s to get data providers and consumers on the same page so they can work together to create a better data product.
There are a lot of moving pieces that go into delivering data at a certain standard of quality. As mentioned before, while everyone would like perfect data, there are limitations to just how fast and how clean datasets can be. Factors outside of engineering’s control like budget constraints, technological limitations, and architectural structure impact a data product’s freshness, fidelity, and remediation speed.
At this stage, it’s important to communicate those constraints. While 99.9999% data completeness might be ideal for our example consumers, you might put forth a more realistic goal of 99.9% data completeness based on certain limiting factors. You could also lay out a situation where 99.9999% might become achievable. Then, it would be up to the consumers to build and present the business case to your organization’s leadership to justify the costs of those changes. More often than not, they’ll accept your compromise and you can review whether that option is still necessary in the future.
How will the data SLA be enforced?
Setting up a data SLA can feel a bit idealistic. Sure, this would be nice and all, but how will this agreement actually play out in real life? Figuring out processes for performance reports, reviews, and remediation is the difference between successful implementation and puffery.
First things first, you need a way to track your SLIs. Without that, you won’t have a verifiable way to show you’re delivering the data at an acceptable level of quality. So you’ll need a data observability tool that lets you measure end-to-end data health and pipeline metadata. This will give you total coverage over data-at-rest and data-in-motion so you can track KPIs that influence your data product.
Once you have end-to-end data observability, who will report SLA performance to whom? And what happens if an SLA is missed?
There is no right or wrong answer to those questions. For a data SLA between your organization and an external consumer, there is a clear course for performance reviews and remediation: the customer lets you know if your data product isn’t meeting standards and you give them a credit or refund as a result.
Things are a little bit more complicated when there SLA is between internal stakeholders. The goal here isn’t to be punitive, but shed light on what exactly is causing data health to degrade within the organization.
An SLA miss can lead to very productive outcomes like reallocating budgets to create dedicated support teams, higher headcount, or new technologies. You’ll have a documented scenario that shows what areas of your business are being impacted by inefficiencies in your data organization, which makes showing a business case for new projects much easier.
Setting SLAs can take a lot of work. Sometimes you’ll need to change your organizational structure, your architectural structure, or your company culture. These questions can act as the underlying framework for creating alignment between your data providers and data consumers.
Once you know how to define your SLA, the next best step is setting up a system for measuring and enforcing that SLA so you can deliver better data products. Setting up a system for end-to-end data observability is the best starting point for that. Once you are able to measure your KPIs, you’ll have the insights you need to make the cultural or technological shifts necessary to better support your data consumers.