Apache Spark use cases for DataOps in 2021
Apache Spark is a powerful data processing solution, and use cases for Apache Spark are near limitless. Over the last decade, it has become core to big data architecture. Expanding your headcount and your team’s knowledge in Spark is a necessity as data organizations adapt to market needs.
Just as the data industry matures, so do its tools. The meteoric rise in popularity of Databricks over its open-source origin clearly shows the overarching trend in the industry: The need for Apache Spark is growing, and the teams that use it are becoming more sophisticated.
The use case for Apache Spark is rooted in Big Data
For organizations that create and sell data products, fast data processing is a necessity. Their bottom line depends on it.
Science Focus estimates Google, Facebook, Microsoft, and Amazon store at least 1,200 petabytes of information. The amount of data they have collected is unthinkable, and for them, it’s mostly inaccessible. Even running on state-of-the-art tools, their data infrastructure cannot process and make meaningful use of all of the stored information.
That’s Big Data. Being able to keep pace with the amount of collected data processing required, and doing so quickly and accurately, means these companies can make their data products (e.i. platforms, algorithms, tools, widgets, etc.) more valuable to their users.
Though, you don’t necessarily need to have millions of users to need Spark. You just need to work with large datasets. Smaller data-driven organizations that have high standards for their data quality SLAs also use Spark to deliver more accurate data, faster. This data then powers their machine learning products, their analytical products, and other data-driven products.
Spark is a powerful solution for some organizations yet overkill for others. For those organizations, Apache Spark use cases are limited because the volume of data they process isn’t large enough, and the timeliness in which data must be delivered isn’t urgent enough to warrant the cost of computation.
When DataOps isn’t handling Big Data, they can’t justify building out a dedicated engineering team, and they can’t justify using specialized tools like Spark. The added complexities and connections to your infrastructure just leave more room for error. In this situation, every time data is extracted, passed to a cluster, computed, combined, and stored, you could open your pipeline to another opportunity for failures and bugs that are hard to catch.
How does Apache Spark work?
Let’s go over the basics of Spark before we start talking about use cases. And to do that, we should start with how Apache Spark came to be.
In the early 2000s, the amount of data being created started outpacing the volume of data that could be processed. To unplug this bottleneck, Hadoop was created based on the MapReduce design pattern. On a very high level, this design pattern sought to divide datasets into small pieces and “map” them to worker nodes (or in this case, separate disk drives called HDFS) to process the data in batches, and then “reduce” them into an “overall outcome,” so to speak.
This worked well for a time, but as the volumes of data and the demand for greater processing speeds grew, the need for a new solution grew. Enter: Apache Spark.
Apache Spark followed the same principle of distributed processing but achieved it in a different way. Spark jobs distribute partitions to in-memory on RAM rather than on HDFS drives. This means that the job doesn’t require reading and writing data partitions to a disk every time. This made Apache Spark 100x faster than Hadoop and brought data teams closer to real-time data processing.
There are more nuances to what makes Spark so useful, but let’s not lose focus. We’re here to learn about Apache Spark use cases for data products.
Apache Spark use cases for DataOps Initiatives
You can do a lot with Spark, but in this article, we’re going to talk about two use cases that are shaping the industry today:
- Productizing machine learning models
- Decentralized data organizations and data meshing
Let’s talk about each one of these use cases, and why they matter for data products in more detail.
Productizing ML models
Machine learning programs are booming as organizations begin an investment arms race, looking for a way to get an edge in their market.
According to Market Research Future, the global machine learning market is projected to grow from $7.3B in 2020 to $30.6B in 2024. And it’s easy to see why. If implemented correctly, the ROI of a high-performing ML product can range from around 2 to 5 times the cost.
That said, there’s a big gap between successful implementation and wasted investments. 9 out of 10 data science projects fail to make it to production because of the risk of bad performance and lack of accessibility to critical data. Even for companies at the forefront of the field, like Microsoft and Tesla, machine learning projects present a catastrophic risk if mismanaged.
Apache Spark was created to attempt to bridge that gap, and while it hasn’t eliminated all the barriers for entry, it has allowed for the proliferation of ML data products to continue.
Spark provides a general machine learning library that is designed for simplicity, scalability, and easy integration with other tools.
MLlib, Apache Spark’s general machine learning library, has algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, clustering, and collaborative filtering. Some of these algorithms are also applicable to streaming data and can help provide sentiment analysis, customer segmentation, and predictive intelligence.
One of the main advantages of using Apache Spark for machine learning is its end-to-end capabilities. When building out an ML pipeline, the data engineer needs to cleanse, process, and transform data into the required format for machine learning. Then, data scientists can use MLlib or an external ML library, like TensorFlow or PyTorch, to apply ML algorithms and distribute the workload. Finally, analysts can use Spark for collecting metrics for performance scoring.
This helps data engineers and scientists solve and iterate their machine learning models faster because they can run an almost entirely end-to-end process on just Spark.
Data Meshing to democratize your data organization
Organizations are becoming more data-driven in their philosophy, but their data architecture is lagging behind. Organizations use centralized and highly siloed data architectures with owners who aren’t collaborating or communicating. Hence, why the overwhelming majority of nascent data products never make it to production.
To create scalable and profitable data products, it’s essential that every data practitioner across your organization is able to collaborate with each other and access the raw data they need.
A solution to this, as first defined by Zhamak Dehghani in 2019, is a data mesh. A data mesh is a data platform architecture that uses a domain-oriented, self-service design. In traditional monolithic data infrastructures, one centralized data lake handles the consumption, storage, transformation, and output of data. A data mesh supports distributed, domain-specific data consumers and views “data-as-a-product” with each domain owning their data pipelines. Each domain and its associated data assets are connected by a universal interoperability layer that applies syntax and data standards across the distributed domains.
This shift in architecture philosophy harkens to the transition software engineering went through from centralized applications to microservices. Data meshes enable greater autonomy and flexibility for data owners, greater data experimentation, and faster iterations while lessening the burden on data engineers to field the needs of every data consumer through a single pipeline.
What does this have to do with Apache Spark?
There are two of the main objections to implementing the data mesh model. One is the need for these domains to have the data engineering skills to ingest, clean, and aggregate data on their own. The other main concern of domain-oriented design is the duplication of efforts, redundant or competing infrastructure, and competing standards for data quality.
Apache Spark is great for solving that first problem. Organizations that have already built out a data mesh platform have used Databricks — which runs on Spark — to ingest data from their data-infra-as-a-platform layer to their domain-specific pipelines. Additionally, Spark is great for helping these self-service teams build out their own pipelines so they can test and iterate on their experiments without being blocked by engineering.
For many in the data industry, they find the idea of a data mesh interesting, but they worry that the unforeseen autonomy of a data mesh introduces new risks related to data health. Oftentimes, they decide this model isn’t right for their organization.
It’s not an unfounded fear. A data mesh needs a system for conducting scalable, self-serve observability to go along with it. According to Zhamak, some of those capabilities include:
- Data product versioning
- Data product schema
- Unified data logging
- Data product lineage
- Data product monitoring/alerting/logging
- Data product quality metrics (collection and sharing)
Our product, Databand, plays very nicely with the idea of a data mesh. It unifies observability, so each domain can use the tools they need, but still be able to answer questions like:
- Is my data accurate?
- Is my data fresh?
- Is my data complete?
- What is the downstream impact of changes to pipelines and pipeline performance?
Being able to answer those questions across your entire tech stack — especially a decentralized one, would allow data organizations to really reap the benefits of this new paradigm.
Distribution left unchecked can spell problems for your data health
Apache Spark is all about distribution. Whether you’re distributing ownership of pipelines across your organization or you’re distributing a computational workload across your cluster, Apache Spark can help make that process more efficient.
That said, the need for observability we talked about in the last section, applies just as much to traditional uses for Spark. That’s because distribution adds additional steps to the end-to-end lifecycle.
Spark is dividing up the computation task, sending partitions to clusters, computing each micro-batch in separate clusters, combining those outcomes, and sending it to the next phase of the pipeline lifecycle. That’s a complex process. Each added step to your pipeline adds an opportunity for error and complications to troubleshooting and root cause analysis.
So while the efficiency of Spark is worth it for some organizations, it’s also important to have a system of observability set up to manage data health and governance as data passes through your pipelines.
After all, what’s the point of running a workload faster if the outcome is wrong?
Want to reduce the risk of your expensive Spark jobs failing?
Databand.ai centralizes your end-to-end pipeline metadata in one place, so you can find and fix data health issues fast