Press Release: IBM Acquires Databand to Extend Leadership in Observability Read now

Shipper Case Study

Eitan
2022-09-15 17:00:00
Shipper Case Study

Shipper Detects Data Incidents From Days To Minutes With Databand

Key Results

100%

Automated detection of schema changes, missing data sources, and critical failures.

10X

Reduction of mean time to detection (MTTD) from days to minutes.

360

Visibility into data model changes from third-party APIs.

Company Overview

Shipper is one of the fastest-growing tech companies in Indonesia, working to digitize Indonesian logistics and enable cost-efficiencies at scale nationwide. Since its 2017 founding, Shipper has built a vast network of fulfillment centers and partnered with hundreds of local delivery companies across the country in pursuit of this goal.

The Challenges

In the highly competitive logistics industry, data uptime and data quality mean everything. Accurately calculating costs, delivery schedules, and inventory can make or break many businesses, especially those like Shipper that offer digital solutions.

Specifically, Shipper’s platform provides customers with a complete dashboard of metrics on shipment logistics so they can see all the pertinent information in a single place to make smarter decisions. As a result, it’s critical that this data is accessible and reliable.

As Shipper grew, the company not only expanded its customer base, but also started to provide more data to those customers to increase the value they get from the platform. Of course more data sources (and more data to track for a growing customer base) means more complex data pipelines.

This increasing complexity forced Shipper to rebuild its pipeline, shifting from on-prem Airflow and Spark to the latest cloud infrastructure with Amazon and Databricks. Unfortunately, the new data platform left the Shipper team with one major blindspot: data observability. As the business continued to scale, Shipper’s ingestion processes became more and more complex, and catching issues before SLAs were missed became nearly impossible.

Fithrah Fauzan, Data Engineering Lead at Shipper, points to three critical challenges the team experienced:

  1. Failed data SLAs due to inaccurate or missing data in customer-facing dashboards
  2. Lack of visibility into data model changes from third party APIs
  3. Heavy costs to the business due to weekly failed pipelines

“Due to the complexity of our ingestion process and the lack of observability in that area, we’d only know if there was some kind of issue with our pipelines after we’d missed our SLA. From there, the only thing we could do was to ask the operational manager to fix it and backfill the data — which could take two to three day. When this was happening on a weekly basis, it became extremely costly and difficult to deal with,” Fauzan explains.

The Solution

Recognizing these challenges, the Shipper team knew they would need to find a solution sooner rather than later to continue growing the business effectively.

Their search for a solution that could help with end-to-end data observability led Shipper straight to Databand. In particular, they found value in Databand’s ability to support:

  • Root cause analysis with automatic notification management, logging, and lineage
  • Automated detection of schema changes, missing data sources, and critical failures
  • Orchestrated remediation workflows for data issue notifications to their DevOps alerting system

According to Fauzan, implementing Databand had an immediate positive impact on the Shipper team’s ability to track pipeline errors, schema changes, and other data quality issues at scale, that way they can identify issues before they miss any SLAs – and resolve those issues faster.

Shipper’s customers feel the benefits of this visibility too. Fauzan shares: “Customers are using our dashboard to report shipment metrics for their business. If data pipelines fail and we miss our SLA, the dashboard will not be correct. Having a way to know whether the data will be delivered and in the right form is extremely important to our customers.”

Business Impact and Results

From tracking data more easily to resolving issues faster, the Shipper team reports that implementing Databand has had a significant, positive business impact.

Reduced Mean Time to Detection and Mean Time to Resolution for Greater System Uptime

Previously, the Shipper team could only detect problems in their data pipeline by manually QAing the data delivery or – worse yet – through customer or team complaints. That’s because pipeline failures weren’t a part of their resolution flow.

Now, with Databand, Shipper can set up pipeline alerts on their ingestion process, pipeline statuses, and anomalous run durations, which has reduced the mean time to detection (MTTD) on issues from three days to mere minutes.

This real-time capturing of data quality issues during ingestion has also empowered the team to improve their mean time to resolution (MTTR). Now, they can detect and resolve issues in real-time thanks to the Databand alerts, which connect directly to the team’s existing workflows in Opsgenie and Jira.

Once Databand detects an issue, the Shipper team can quickly conduct a
root cause analysis. Specifically, the logs within Databand enable the team to diagnose the affected pipeline in minutes, rather than spending hours tracking down pipeline owners, searching through logs, and tracing source lineage.

Altogether, the visibility provided by Databand has dramatically improved system uptime and given the engineering team much-needed peace of mind.

Improved Data SLAs for Happier Customers

The lack of visibility the Shipper team had before Databand meant they couldn’t track progress toward meeting SLAs until after they had already missed those commitments. This meant Fauzan needed to manually track pipeline successes and failures retroactively to understand performance.

Databand has changed this entirely, making it easy for Shipper to measure and guarantee their SLAs in real-time. Now, Fauzan can use the Databand dashboard to quickly see how the team is tracking toward their SLAs and visualize how much of an error budget they have left for the rest of the month.

This improved ability to meet data SLAs for both external customers and internal data consumers has measurably improved the user experience, leading to happier customers engaging with Shipper dashboards.

Testimonial Image

Without Databand, we didn’t know we had problems until two or three day later. Databand helps us detect data quality issues faster so we can meet our data SLAs.

Fithrah Fauzan
Data Engineering Lead at Shipper

Read G2 Peer Reviews

Ryan Yackel
2022-09-12 13:45:25

Trax Retail Case Study

Marco Alcaria
2022-09-08 22:52:55

Trax Retail Drastically Reduces Data Incidents, Increases Customers by 3x

Key Results

99%

Reduction of data incidents across all ML pipelines.

3X

Increase in customer base while keeping engineering costs flat.

96%

Model accuracy by improving data pipeline reliability.

Company Overview

Trax Retail offers advanced solutions for dynamic merchandising, in-store execution, shopper engagement, market measurement, analytics, and shelf monitoring to help drive positive shopper experiences and unlock revenue opportunities at all points of sale. As a global pioneer serving customers in more than 90 countries, Trax Retail leads the industry in innovation and excellence through the development of advanced technologies and autonomous data collection methods.

The Challenges

Trax Retail’s solutions are based on AI models trained using neural networks to recognize products on the shelf in supermarkets and grocery stores. The company’s AI-Engineering team supports these models, focusing on researching infrastructure, training the AI, and retraining the models as needed to maintain accuracy.

They also manage the production pipeline to monitor for issues.

According to Tzoof Hemed, AI-Engineering Team Leader at Trax Retail, this entire deep learning process is typically very tedious. It requires collecting and processing data, training the models, monitoring that work, and then comparing it with current solutions. For instance, it could take weeks for a data scientist to complete.

This type of process is particularly challenging at scale.

Each Trax Retail customer requires a unique AI model, and that adds up quickly. As the company grew, conducting large scale training across the entire customer base required team members to write down the IP addresses of the servers they used so they could monitor each of their experiments and runs to see the outputs of that training.

And the process doesn’t stop there. The type of deep learning pipelines the Trax Retail team builds are quite long. That means they’re composed of multiple tasks, and a failure in a single task can affect all the others that follow.

In turn, this situation makes debugging complicated without a clear view of the specific scope of each task and its larger impact.

However, Hemed shares that not only did his team not have this detailed view, but their network was actually more of a “black box,” which made the debugging process even more difficult.

Recognizing these challenges, Hemed and his team began the search for a solution that would allow them to increase the scalability of their work to support Trax Retail’s growing business.

“We were looking for something that would orchestrate all the different steps that we take, which could be dozens or even hundreds depending on the implementation. We wanted to save our team time so each person didn’t have to monitor their servers or tasks at the individual level. We wanted a single platform where someone could just log in, see all their experiments, confirm all their runs went through, and easily search for details like the outputs of different tasks and the results of training,” Hemed explains.

The Solution

The search for a solution to automate the deep learning process, including collecting and processing data, training data, monitoring AI models, and debugging, led the Trax Retail team to Databand.

Databand’s proactive data observability platform offered exactly what the Trax Retail team needed. Specifically, it helped solve two key challenges Hemed and his team faced:

  1. Automating the data pipeline, including optimizing how data gets collected, processed, and trained.
  2. Monitoring data pipelines to understand accuracy and support debugging needs.

Based on these capabilities, Hemed shares that the decision to implement Databand was an easy one: “Databand gives us a single platform where everyone can talk about the same outputs, and that’s a key point for us. It allows us to see all of our deep learning training pipelines in one view to troubleshoot any issues, plus we can easily compare pipelines to see what’s working and what’s stopped working.”

The Trax Retail team implemented Databand alongside the company’s Kubernetes engine, which allows for true automation. Hemed reports that Databand now implements all requests in Kubernetes and ensures that runs are complete, which means his team doesn’t have to make any deployments in Kubernetes.

“We just run a command line. It’s that easy,” he says.

Business Impact and Results

Trax Retail has seen incredible business impact since implementing Databand, most notably around reduced pipeline incidents and increased scalability of the AI-Engineering team.

Reduced Pipeline Incidents

Prior to using Databand, Hemed reports that about 60% of his team’s pipelines experienced data incidents. Since implementing Databand, that number has dropped down to less than 1%.

He attributes this in large part to Databand’s ability to surface those incidents quickly and allow for immediate action. “Without Databand, finding those incidents would take days of debugging, including looking into all of the outputs and logging them. With Databand, we can find the incidents in minutes.

Having everything in the same place just makes it so much easier,” Hemed says.

In general, Hemed shares that his team is impressed with how many tests Databand enables them to run simultaneously and the downstream effect that has had on reducing pipeline incidents.

Increased Stability of AI-Engineering Team

Databand has enabled the AI-Engineering teams to train and manage more deep learning models in a sustainable way. In turn, this means the team can handle more models for more customers without having to add more data scientists to the mix.

“One of the main advantages of introducing Databand was the scale for our team. Databand allowed us to jump from one data scientist handling a few models to being able to manage hundreds of different runs. That was a big leap for us,” Hemed explains.

Notably, this increased scalability has enabled Trax Retail to triple their customer base while keeping engineering costs flat since they’ve been able to do more with the same team in place. Hemed concludes: “In order for our business to grow, we need to be able to provide more models to new customers. If our team can’t train those models, that’s a huge bottleneck. Databand has alleviated those concerns for us. And our customers can notice the impact too: We can get them set up with a new solution faster since it takes less time and fewer people working on it to complete.”

Testimonial Image

Databand allows us to run multiple tasks and monitor them all in the same platform. Quite simply, Databand has made the process of training and orchestrating a complicated and long pipeline of neural networks easier.

Tzoof Hemed
AI-Engineering Team Leader at Trax Retail