Data stacks enable data integration throughout the entire data pipeline for trustworthy consumption. But how can companies ensure their data stack is both modern and reliable? In this blog post, we discuss these issues, as well as how GCP manages its data stack.
This blog post is based on a podcast where we hosted Sudhir Hasbe, senior director of product management for all data analytics services at Google Cloud.
You can listen to the entire episode below or here.
What is a Data Stack?
In today’s world, we have the capacity and ability to track almost any piece of data. But attempting to find relevant information in such huge volumes of data is not always so easy to do. A data stack is a tool suite for data integration. It transforms or loads data into a data warehouse, enables transformation through an engine running on top and provides visibility for building applications.
As companies evolve, they often move towards a modern data stack that is based on a cloud-based warehouse. Such a stack enables real-time, personalized experiences or predictions on SaaS applications by supporting real-time events and decision-making.
How do Modern Data Stacks Support Real-Time Events?
To support real-time events, modern data stacks include the following components:
- A system for collecting events, like Kafka
- A processing system, like a streaming analytics systems or a Spark streaming solution
- A serving layer
- A data lake or staging environment where raw data can be pushed and transformed before it gets loaded into a data warehouse
- A data warehouse for structured data that enables creating machine learning models
Such an environment is very complex and requires controls to ensure high data quality. Otherwise, bad data will be pulled into different systems and create a poor customer experience.
Therefore, it is important to ensure data quality is taken into consideration as early as the design phase of the data stack.
How Can You Ensure Data Quality in the Data Stack?
As companies rely on more and more data sources, managing them becomes more complex. Therefore, to ensure data quality throughout the entire pipeline, it becomes important to understand the source of issues, i.e where they are originally coming from.
Data engineers who only look at the tables or the dashboards downstream will be wasting a lot of time trying to find where issues are coming from. They might be able to catch the problem, but by tracing them back to the source they will be able to debug and solve the issue much more quickly.
By shifting left observability requirements, data engineers can ensure data quality as early as ingestion, and enable much higher delivery velocity.
How Google (GCP) Ensures Delivery Velocity in the Data Stack
One of the main pain points data teams have when building and operationalizing a data stack is how to ensure delivery velocity.
This is true for both on-prem and cloud-native stacks, but becomes more pressing when companies are required to support real-time events both quickly and with high data quality.
To ensure delivery velocity at GCP, the team implements the following solutions.
At GCP, one of the most critical components in the data stack is end-to-end pipelines. Thanks to these pipelines, they can ensure real-time events from various sources are available in their data warehouse, BigQuery. To cover streaming analytics use cases, data is seamlessly connected to BigTable and DataFlow.
At GCP, all storage capabilities are unified and consistent in a single data lake, enabling different types of processing on top of it. Thus, each persona can use their own skills and tools to consume it.
For example, a data engineer could use Java or Python, a data scientist could use notebooks or TensorFlow and an analyst could use other tools to analyze the data.
The Future of Data Management in the Data Stack
Here are three interesting predictions regarding the future of data stacks.
Leverage AI and ML to Improve Data Quality
One of the most interesting ideas when discussing the future of data management is about implementing AI and ML to improve data quality. Often, machine learning is used to improve business metrics.
At GCP, the team is implementing ML on BigQuery to identify failures. They find it is the only way to detect issues at scale. While this practice hasn’t been widely adopted by many companies, yet, it is expected to in the future.
Less Manual, More Automation
Automation is predicted to be widely adopted, as a means for managing the huge volumes of data of the future.
Today, manual management of legacy data platforms is complex, since it is based on components like Hadoop Spark running on a data warehouse with manually-defined rules and integrations with Jira to enable more personas to run queries.
The result is often hundreds of tickets with false alarms. In the future, automation will cover:
- Metric collection
- Alerts (without having to manually define rules)
- New data sources and products
It will include automatic data discovery through automated cataloging of all information, including centralized metadata, automated lineage tracking across all systems and metadata lineage.
This will reduce the number of errors and streamline the process.
Persona Changes are Coming
Finally, we predict that the personas who process and consume the data will change as well. Data consumption will not be limited to data engineers, data scientists and analysts, but will be open to all business employees.
As a result, in the future, storage will just be a price-performance discussion for customers rather than capability differentiation.
The future sounds bright! Databand provides data engineers with observability into data sources straight from the source. To get a free trial, click here.