The Impact of Bad Data and Why Observability is Now Imperative
Think the impact of bad data is just a minor inconvenience? Think again.
Bad data cost Unity, a publicly-traded video game software development company, $110 million.
And that’s only the tip of the iceberg.
The Impact of Bad Data: A Case Study on Unity
Unity stock dropped 37% on May 11, 2022, after the company announced its first-quarter earnings, despite strong revenue growth, decent margins, good customer growth, and continued high performance in dollar-based net expansion.
But there was one data point in Unity’s earnings that were not as positive.
The company also shared that its Operate revenue growth was still up but had slowed due to a fault in its platform that reduced the accuracy of its Audience Pinpointer tool.
Unity ingested bad data from a large customer into its machine learning algorithm, which helps place ads and allows users to monetize their games. This not only resulted in decreased growth but also ruined the algorithm, forcing the company to fix it to remedy the problem going forward.
The company’s management estimated the impact on the business at approximately $110 million in 2022.
Unity Isn’t Alone: The Impact of Bad Data is Everywhere
Unity isn’t the only company that has felt the impact of bad data deeply.
On April 25, 2022, Twitter accepted a deal to be purchased by Tesla and SpaceX founder Elon Musk. A mere 18 days later, Musk shared that the deal was “on hold” as he confirmed the number of fake accounts and bots on the platform.
What ensued demonstrates the deep impact of bad data on this extremely high-profile deal for one of the world’s most widely-used speech platforms. Notably, Twitter has battled this data problem for years. In 2017, Twitter admitted to overstating its user base for several years, and in 2016 a troll farm used more than 50,000 bots to try to sway the US presidential election. Twitter first acknowledged fake accounts during its 2013 IPO.
Twitter, like Unity, is another high-profile example of the impact of bad data, but examples like this are everywhere – and it costs companies millions of dollars.
Gartner estimates that bad data costs companies nearly $13 million per year, although many don’t even realize the extent of the impact. Meanwhile, Harvard Business Review finds that knowledge workers spend about half of their time fixing data issues. Just imagine how much effort they could devote elsewhere if issues weren’t so prevalent.
Overall, bad data can lead to missed revenue opportunities, inefficient operations, and poor customer experiences, among other issues that add up to that multi-million dollar price tag.
Why Observability is Now Imperative for the C-Suite
The fact that bad data costs companies millions of dollars each year is bad enough. The fact that many companies don’t even realize this because they don’t measure the impact is potentially even worse. After all, how can you ever fix something of which you’re not fully aware?
Getting ahead of bad data issues requires data observability, which encompasses the ability to understand the health of data in your systems. Data observability is the only way that organizations can truly understand not only the impact of any bad data but also the causes of it – both of which are imperative to fixing the situation and stemming the impact.
It’s also important to embed data observability at every point possible with the goal of finding issues sooner in the pipeline rather than later because the further those issues progress, the more difficult (and more expensive) they become to fix.
Critically, this observability must be an imperative for C-suite leaders, as bad data can have a serious impact on company revenue (just ask Unity and Twitter). Making data observability a priority for the C-suite will help the entire organization – not just data teams – rally around this all-important initiative and make sure it becomes everyone’s responsibility.
Identify data issues earlier on in the data pipeline to stem their impact on other areas of the platform and/or business
Pinpoint data issues more quickly after the pop up to help arrive at solutions faster
Understand the extent of data issues that exist to get a complete picture of the business impact
In turn, this visibility can help companies recover more revenue faster by taking the necessary steps to mitigate bad data. Hopefully, the end result is a fix before the issues end up costing millions of dollars. And the only way to make that happen is if everyone, starting with the C-suite, prioritizes data observability.
Apache Spark use cases for DataOps in 2021
Apache Spark is a powerful data processing solution, and use cases for Apache Spark are near limitless. Over the last decade, it has become core to big data architecture. Expanding your headcount and your team’s knowledge in Spark is a necessity as data organizations adapt to market needs.
Just as the data industry matures, so do its tools. The meteoric rise in popularity of Databricks over its open-source origin clearly shows the overarching trend in the industry: The need for Apache Spark is growing, and the teams that use it are becoming more sophisticated.
The use case for Apache Spark is rooted in Big Data
For organizations that create and sell data products, fast data processing is a necessity. Their bottom line depends on it.
Science Focus estimates Google, Facebook, Microsoft, and Amazon store at least 1,200 petabytes of information. The amount of data they have collected is unthinkable, and for them, it’s mostly inaccessible. Even running on state-of-the-art tools, their data infrastructure cannot process and make meaningful use of all of the stored information.
That’s Big Data. Being able to keep pace with the amount of collected data processing required, and doing so quickly and accurately, means these companies can make their data products (e.i. platforms, algorithms, tools, widgets, etc.) more valuable to their users.
Though, you don’t necessarily need to have millions of users to need Spark. You just need to work with large datasets. Smaller data-driven organizations that have high standards for their data quality SLAs also use Spark to deliver more accurate data, faster. This data then powers their machine learning products, their analytical products, and other data-driven products.
Spark is a powerful solution for some organizations yet overkill for others. For those organizations, Apache Spark use cases are limited because the volume of data they process isn’t large enough, and the timeliness in which data must be delivered isn’t urgent enough to warrant the cost of computation.
When DataOps isn’t handling Big Data, they can’t justify building out a dedicated engineering team, and they can’t justify using specialized tools like Spark. The added complexities and connections to your infrastructure just leave more room for error. In this situation, every time data is extracted, passed to a cluster, computed, combined, and stored, you could open your pipeline to another opportunity for failures and bugs that are hard to catch.
How does Apache Spark work?
Let’s go over the basics of Spark before we start talking about use cases. And to do that, we should start with how Apache Spark came to be.
In the early 2000s, the amount of data being created started outpacing the volume of data that could be processed. To unplug this bottleneck, Hadoop was created based on the MapReduce design pattern. On a very high level, this design pattern sought to divide datasets into small pieces and “map” them to worker nodes (or in this case, separate disk drives called HDFS) to process the data in batches, and then “reduce” them into an “overall outcome,” so to speak.
This worked well for a time, but as the volumes of data and the demand for greater processing speeds grew, the need for a new solution grew. Enter: Apache Spark.
Apache Spark followed the same principle of distributed processing but achieved it in a different way. Spark jobs distribute partitions to in-memory on RAM rather than on HDFS drives. This means that the job doesn’t require reading and writing data partitions to a disk every time. This made Apache Spark 100x faster than Hadoop and brought data teams closer to real-time data processing.
There are more nuances to what makes Spark so useful, but let’s not lose focus. We’re here to learn about Apache Spark use cases for data products.
Apache Spark use cases for DataOps Initiatives
You can do a lot with Spark, but in this article, we’re going to talk about two use cases that are shaping the industry today:
Productizing machine learning models
Decentralized data organizations and data meshing
Let’s talk about each one of these use cases, and why they matter for data products in more detail.
Productizing ML models
Machine learning programs are booming as organizations begin an investment arms race, looking for a way to get an edge in their market.
According to Market Research Future, the global machine learning market is projected to grow from $7.3B in 2020 to $30.6B in 2024. And it’s easy to see why. If implemented correctly, the ROI of a high-performing ML product can range from around 2 to 5 times the cost.
That said, there’s a big gap between successful implementation and wasted investments. 9 out of 10 data science projects fail to make it to production because of the risk of bad performance and lack of accessibility to critical data. Even for companies at the forefront of the field, like Microsoft and Tesla, machine learning projects present a catastrophic risk if mismanaged.
Apache Spark was created to attempt to bridge that gap, and while it hasn’t eliminated all the barriers for entry, it has allowed for the proliferation of ML data products to continue.
MLlib, Apache Spark’s general machine learning library, has algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, clustering, and collaborative filtering. Some of these algorithms are also applicable to streaming data and can help provide sentiment analysis, customer segmentation, and predictive intelligence.
One of the main advantages of using Apache Spark for machine learning is its end-to-end capabilities. When building out an ML pipeline, the data engineer needs to cleanse, process, and transform data into the required format for machine learning. Then, data scientists can use MLlib or an external ML library, like TensorFlow or PyTorch, to apply ML algorithms and distribute the workload. Finally, analysts can use Spark for collecting metrics for performance scoring.
This helps data engineers and scientists solve and iterate their machine learning models faster because they can run an almost entirely end-to-end process on just Spark.
Data Meshing to democratize your data organization
Organizations are becoming more data-driven in their philosophy, but their data architecture is lagging behind. Organizations use centralized and highly siloed data architectures with owners who aren’t collaborating or communicating. Hence, why the overwhelming majority of nascent data products never make it to production.
To create scalable and profitable data products, it’s essential that every data practitioner across your organization is able to collaborate with each other and access the raw data they need.
A solution to this, as first defined by Zhamak Dehghani in 2019, is a data mesh. A data mesh is a data platform architecture that uses a domain-oriented, self-service design. In traditional monolithic data infrastructures, one centralized data lake handles the consumption, storage, transformation, and output of data. A data mesh supports distributed, domain-specific data consumers and views “data-as-a-product” with each domain owning their data pipelines. Each domain and its associated data assets are connected by a universal interoperability layer that applies syntax and data standards across the distributed domains.
This shift in architecture philosophy harkens to the transition software engineering went through from centralized applications to microservices. Data meshes enable greater autonomy and flexibility for data owners, greater data experimentation, and faster iterations while lessening the burden on data engineers to field the needs of every data consumer through a single pipeline.
What does this have to do with Apache Spark?
There are two of the main objections to implementing the data mesh model. One is the need for these domains to have the data engineering skills to ingest, clean, and aggregate data on their own. The other main concern of domain-oriented design is the duplication of efforts, redundant or competing infrastructure, and competing standards for data quality.
Apache Spark is great for solving that first problem. Organizations that have already built out a data mesh platform have used Databricks — which runs on Spark — to ingest data from their data-infra-as-a-platform layer to their domain-specific pipelines. Additionally, Spark is great for helping these self-service teams build out their own pipelines so they can test and iterate on their experiments without being blocked by engineering.
For many in the data industry, they find the idea of a data mesh interesting, but they worry that the unforeseen autonomy of a data mesh introduces new risks related to data health. Oftentimes, they decide this model isn’t right for their organization.
Data product quality metrics (collection and sharing)
Our product, Databand, plays very nicely with the idea of a data mesh. It unifies observability, so each domain can use the tools they need, but still be able to answer questions like:
Is my data accurate?
Is my data fresh?
Is my data complete?
What is the downstream impact of changes to pipelines and pipeline performance?
Being able to answer those questions across your entire tech stack — especially a decentralized one, would allow data organizations to really reap the benefits of this new paradigm.
Distribution left unchecked can spell problems for your data health
Apache Spark is all about distribution. Whether you’re distributing ownership of pipelines across your organization or you’re distributing a computational workload across your cluster, Apache Spark can help make that process more efficient.
That said, the need for observability we talked about in the last section, applies just as much to traditional uses for Spark. That’s because distribution adds additional steps to the end-to-end lifecycle.
Spark is dividing up the computation task, sending partitions to clusters, computing each micro-batch in separate clusters, combining those outcomes, and sending it to the next phase of the pipeline lifecycle. That’s a complex process. Each added step to your pipeline adds an opportunity for error and complications to troubleshooting and root cause analysis.
So while the efficiency of Spark is worth it for some organizations, it’s also important to have a system of observability set up to manage data health and governance as data passes through your pipelines.
After all, what’s the point of running a workload faster if the outcome is wrong?