The data supply chain: first-mile reliability

Walmart revolutionized supply chain management. Now more than ever, their mantra of “the best supply chain wins” rings true as they duke it out with Amazon for retail supremacy. It’s a battle for efficiency. Whoever delivers more products, more efficiently, with fewer errors, wins. 

Data management is a lot like supply chain management in that respect. Just like supply chain management, data management strives for the same goals and suffers from the same challenges. And data management, just like supply chain management, faces a hard-to-solve problem: first-mile reliability.

What is the data supply chain?

Supply chain management is the practice of overseeing the flow of goods, materials, services, and processes that transform raw materials into final products. When you think of your data management and your data platform, ideally, it would function the same way. You seek out sources of raw data from internal and external providers. You ship that data to your lake where it is processed and structured. From there, it is sent to a data warehouse to be aggregated. Finally, it is queried from the warehouse and consumed.

data supply chain

Viewing your data platform as a data supply chain allows you to connect the dots between different stages of the lifecycle and their effect on the end-product. This shift in thinking coincided with the adoption of the data-as-a-product model (DaaP).

The data supply chain and DaaP

Data teams were collecting more data than ever before, but they weren’t doing a great job of making it usable. With only 3% of organizations able to meet data quality standards, data teams needed a new strategy. 

To stop the bleeding storage costs, data teams adopted the data-as-a-product model (DaaP). This philosophy dictates that data teams should treat the data they provide as a product and the people who utilize it as their customers. Essentially, the main KPI for every member of the data team was now based on the value they provide for end-users. 

Establishing a data supply chain requires data teams to break down the technical and operational silos in their organizations because everyone in the data organization is held accountable for the end-to-end process, rather than just the section they have ownership over.

The hard part of your data supply chain: the first mile

Data supply chains allow you to optimize your data life cycle, deliver more value to your customers, and cut excess costs. While that all sounds great, there’s a big, hairy problem that stands in your way.

In supply chain management, there’s something called the first-mile/last-mile problem. The first mile refers to the difficulty in getting the raw materials for your supply chain from wherever they are extracted to wherever they go to get processed. Conversely, the last mile problem refers to the difficulty in getting the finished product from the shipping depot to the customer. These very same problems also plague your data supply chain. 

That last mile refers to making data in your warehouse accessible and usable to your analysts, scientists, and end-users. A lot of thought, effort, and money has gone into alleviating the last-mile problem. Managed transformation tools, dataset monitoring tools, and hybrid cloud infrastructure have proliferated to help analysts and data scientists query the data they need faster.

The first mile, on the other hand, is a problem that hasn’t gotten as much attention from the industry. For many data-intensive organizations, their data product is fueled by tens to hundreds of external data sources. Schema changes, volume anomalies, late-deliveries plague this first mile and then go on to infect your downstream warehouse tables and business processes. The reliability of all those data sources are questionable, and they represent points of failure that are out of a data team’s control & awareness. These types of organizations need to dedicate more time and energy to this problem if they want to successfully deliver their data products.

Steps to solving the first-mile problem

Facebook, Tiktok, and Snowflake have all proven you can create a lot of value for customers by collecting user data. That said, you can only take it so far. Consumers are demanding more comprehensive insights, and for tooling, more consolidated tech stacks. Users aren’t able to provide all of the data they find useful. If the future of data products lies in external data, there needs to be first-mile reliability. 

What steps can you take to solve the first-mile problem? 

  1. Identify the common problems
  2. Start collecting and monitoring metadata
  3. Establish SLAs

The goal of this list is to set you up with the building blocks for success that you can apply to your data organization. Let’s dive into each step in more detail.

Identify common problems

You might know that the first mile of your data supply chain is painful, but you don’t have the monitoring and tracking capabilities to pinpoint performance trends and root causes quickly. That’s okay, we’ll tackle that in the next step. That said, every data system is highly contextual. If you’ve been at your company for any amount of time, you probably have a gut feeling of the most painful issues.

Some examples could include:

  • Schema changes from API calls
  • Data format issues from sources without an API
  • Pipeline failures during structuring
  • Late deliveries from sources
  • Anomalous data volumes & run durations

This will help you narrow down which kinds of data sources need better reliability and their associated data pipelines.

Start collecting and monitoring metadata

Once you know which data sources and ingestion pipelines require the most TLC, you need to start collecting metadata from those processes. This step lays the groundwork for a data observability framework

Hierarchy of data observability

It allows you to gain insights into how your system is performing. You can visualize this metadata in a time series and start to identify trends, like how your system behaves when it’s healthy and when it’s not. With that level of context, you can start setting up advanced alerting so you can reduce your time-to-detection and time-to-resolution on these first-mile issues.

There are three levels of context you can gain from collecting metadata: Operational Health & Dataset Monitoring, Column-level profiling, Row-level validation. The most important is the foundational layer. It allows you to monitor data in motion and data at rest and lays the contextual foundation you will need to quickly detect and resolve issues. 

Data at rest

Monitoring “data at rest” refers to monitoring a dataset as a whole. You are getting visibility into the state of your data while it’s in a static location like a table in your data lake, data warehouse, or source database. 

Data in motion

Monitoring “data in motion” refers to monitoring data pipeline performance. This gives you awareness of the state of your data while it’s transforming and moving through your pipelines.

While dataset and data pipeline monitoring are usually separated into two distinct activities, it’s important to keep them coupled together to achieve a solid foundation of observability. If you’re only looking at your datasets, you can know that an unexpected change occurred, but you won’t know why. Vice versa, if you’re only looking at your pipelines, you’ll know about pipeline failures and late deliveries, but you won’t know how transformations are affecting the data itself. By siloing these two activities, data engineers and platform engineers won’t have the context they need to find and fix issues fast.

Establish SLAs with your consumers

The ability to monitor the first mile of your data supply chain allows you to set and track KPIs for it and connect these issues to downstream dependencies in your warehouse. A great way of defining success with your internal and external users is by establishing a data SLA with them. 

A data SLA is a formal agreement between data providers and data consumers that defines an acceptable level of data quality. This acts as a contract between the data team (not just data engineering) and the consumer (internal or external). This SLA doesn’t need to be written down, but it could help. At the very least, an unwritten SLA should take the form of a conversation between both parties where they outline clear and measurable expectations for the data products they deliver or consume.

It should answer questions like:

  • Which data tables/data pipelines require an SLA?
  • Are data consumer expectations realistic?
  • What metrics are used to measure success?
  • How will this data SLA be enforced?

With clarity on those questions, data teams can more effectively hold themselves accountable for the goals they set. With visibility into the first-mile and quality standards set, data teams can begin to enter the iterative process that the DataOps Cycle enables.

Achieving first-mile reliability requires visibility

Organizations are placing a much higher priority on collecting and utilizing more and more data. While there is operational and technological debt creating obstacles, there’s a bigger problem that these organizations face when confronting the first-mile problem. They’re collecting lots of data and processing it more efficiently, but they don’t have a lot of data about their data.

Metadata is the key to unlocking the first mile. Without metadata, you don’t have any visibility into your ingestion layer. You don’t know exactly where in your system problems are occurring. You don’t know how to trace those issues to a specific pipeline error or data source. And you don’t know where current performance stands against your goals.

Implementing a data observability solution to the road to a healthier data supply chain, a more efficient first mile, and a more reliable data product.