Data uptime and data quality play a huge role in the logistics industry. Calculating costs, delivery schedules, and inventory can make or break many businesses in this highly competitive industry. This challenge is even more pronounced for logistics aggregators that enable a digital supply chain experience, like our client, Shipper.
Shipper is one of the fastest-growing technology companies in Indonesia, providing end-to-end digital supply chain solutions for businesses of all sizes in a market where last-mile fulfillment and delivery is estimated at US$80B. Shipper’s technology platform is able to connect its partners’ resources for better logistics solutions for customers across Indonesia. Shipper’s approach has enabled companies to build a network and scale rapidly within a few years, while others have taken multiple decades to build, offering cost efficiencies, nationwide scale, and end-to-end visibility.
Any platform like this sounds simple on the surface, but in reality, they hide an intricate web of pipelines and hundreds of third-party sources.
Data quality correlates to product ROI
Data organizations like Shipper provide value by presenting all the information you need in one place to make smarter decisions. The more data you provide, the more value your customers get. Unfortunately, that also means the more data sources and more complex your pipelines will become.
Due to the nature of this challenge, Shipper’s data team is mature for a business of its size. Like many data organizations, every unit of the business has its own ingestion processes. As mature as their team is, their team can’t hold everything together without the right tools.
“Good data quality” means when data is what you expect. I mean that in terms of data volume sent to our data lake, the data’s schema, and the job duration of our pipelines. Once the data is in our lake, it can be accessed by the Business Intelligence team, the data analysts, and automatic processes that feed the product’s dashboard. If the data in the lake is in the wrong format, if it’s incomplete, and if there are duplicates, it is hard to solve that issue before it is utilized in major areas of the business — including the customer-facing platform.
Customers are using our dashboard to report shipment metrics for their business. If data pipelines fail and we miss our SLA, the dashboard will not be correct. Having a way to know whether the data will be delivered and in the right form is extremely important to our customers.
But keeping track of pipeline errors, schema changes, and other data quality issues [across over 100 DAGs and many more sources] is very difficult.” — Fithrah Fauzan
Shipper built their data platform a few years ago using on-prem Airflow and Spark. Data is loaded to a data lake on S3 in Parquet format. This worked well to grow Shipper in the beginning, but at some point, it reached the point where it couldn’t scale anymore, and maintaining airflow became a problem.
Then, Shipper rebuilt their pipeline using the latest cloud infrastructure. Shipper has decoupled the scheduling of the DAG structure from the business logic so they can scale as needed using Amazon MWAA triggers ECS Fargate jobs & Databricks Spark jobs.
Though, there was a major blindspot they didn’t account for as they built their new data platform: data observability. As their business scaled, their ingestion processes became more and more complex, and catching issues before SLAs were missed became nearly impossible.
At Shipper, you can group data sources into three categories.
The first type of data source comes from Shipper’s backend operation system. The operation databases are replicated using CDC to copy data from RDS of internal service backend systems. The issues that can occur at this stage are fairly straightforward; pipelines can fail to deliver data on time or in the wrong format during the replication process.
The other sources that can cause problems that are more complicated; those are Google Drive and third-party APIs.
Shipper’s operational teams are masters of their domains. When needed, they add additional data to their backend operation system and provide their unique input and reports to Shipper’s data lake in the form of .xls or .csv files. Though, like all human-generated data, they have issues around schema consistency.
The other two categories are third-party sources: Ones that provide an API and others that are ingested via scraping. For the third-party APIs, they’ll need to create an automated pipeline that will scrape the data and store it in our data lake. The problem here is that APIs provided by third-party can change all the time to better fit their data models. Sometimes they let you know in advance. Other times they don’t. Either way, Shipper is the one left picking up the pieces.
Google sheets: the bane of data engineering
Many people still use spreadsheets, they are easy to use and allow you to quickly store and import the data you need. But over time, it can get out of control pretty fast.
You might hit a single spreadsheet row or column limit forcing you to split one input data into multiple files to ingest, or accidental schema change for data pipelines that need to follow strictly coded logic. If schemas deviate from what the pipeline is made to handle, the pipelines can stall.
There are so many pitfalls. Everyone is uploading datasets, splitting files, and adding data. This inconsistency in governance spells trouble for data pipelines that need to follow strictly coded logic. Even if the pipeline succeeds data might not get populated into the dashboard correctly — if at all.
“I think it’s a common problem in our industry because many organizations are still using Excel and Google Sheets to store and import data into their data lake. In these spreadsheets, it’s nearly impossible to track how fresh the data is and if there was some kind of typo that could cause a breaking schema change.
[Due to the complexity of our ingestion process and the lack of observability in that area,] We’ll only know if there is some kind of issue with our pipelines after we’ve missed our SLA. From there, the only thing we can do is ask the operational manager to fix it and backfill the data — which can take two to three days. When this is happening on a weekly basis, this becomes extremely costly and difficult to deal with.” — Fithrah Fauzan
Shipper decreased their mean time-to-detection from three days to minutes
Right now, Shipper is using Databand to measure and guarantee their SLAs. Before Databand, Fithrah needed to measure the performance of his team by manually tracking how many pipelines succeeded or failed in the last month. Today, he can find that out very quickly using the Databand Dashboard and tracking how much of an error budget he has left for the rest of the month.
(Click to enlarge)
More importantly, Shipper can now set up pipeline alerts on their ingestion process. In the past, the only way they would know if there was a problem was if they manually QA’d the data delivery, someone from BI told them, or if a customer complained. That’s because pipeline failures weren’t a part of their resolution flow.
Now, they can detect and resolve issues in real-time with alerts on pipeline statuses and anomalous run durations. Anytime there is a problem with a pipeline, Databand fires the alert and pushes the alert to Opsgenie and Jira. The Shipper team can quickly detect and prioritize pipeline issues.
“When a pipeline fails, we know as soon as it happens. An alert gets automatically pushed to Opsgenie, and we can get started resolving the issue before we need to deal with backfilling that data and missing the SLA.” — Fithrah Fauzan
Once they detect the issue, root cause analysis can happen quickly. Using the “Logs” tab, they can diagnose the affected pipeline in minutes, rather than spending hours or days tracking down owners of pipelines, searching through logs, and tracing source lineage.
(Click to enlarge)
Databand has completely revolutionized Shipper’s operational flow. Shipper has significantly improved their data deliveries and provides more value through their platform. They have dramatically improved their system uptime and reduced their time-to-detection from 2 to 3 days, to minutes.
With unified observability and alerting, Shipper can deliver a better product and give the engineering team some much-needed peace of mind. Equipped with end-to-end observability, Fithrah looks towards the future:
“Now that we know that our DAGs are running successfully and on time, I think the most important thing for us in the future is implementing the data quality tracking within Databand. Sometimes, you can meet your data delivery SLA, but the data quality is bad. That still hurts consumer trust in our product.” — Fithrah Fauzan
Databand goes deeper and wider than any other platform
Shipper’s situation is not unique. Do you have hundreds of sources and strict data SLAs? You need an observability platform that lets you monitor your pipeline and data quality status.
By connecting to pipeline orchestrators like Apache Airflow and centralizing your end-to-end metadata, Databand can help you identify data quality issues and their root causes from a single dashboard.
(Click to enlarge)
Want to know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives? Get started today for free!