Data Pipeline Performance Monitoring — What it is and Why it’s Here
We are now in the “slope of enlightenment” for data products. The maturity state of the market is clear by the fact that organizations are now shifting their hiring from large research teams towards teams that manage production activities.
After a wave of building out data science teams to validate new machine learning, predictive, and analytical products, companies are rushing to hire data engineers who can build and scale production systems to actually bring those products into market.
Data engineers manage both ends of the workflow around data scientists: (1) the systems that make sure data science teams have consistent, reliable data so that they can scale up their ML development activities, and (2) the systems that take data science models and run them in production.
You can see this demand shift across job listing platforms. According to research done by Datanami on the topic:
A quick search for data engineering job listings on Glassdoor found 107,730 data engineering jobs around the country. Most of those job listings had a starting pay in excess of $100,000 — and a few were over $150,000. By contrast, a search for data scientist yielded 21,760 jobs — many of which were also high paying. On the Indeed job board, there were 98,218 data engineer jobs posted, compared to 24,695 for data scientists — nearly a four-to-one ratio of data engineering jobs to data scientist jobs
Managing and Scaling Data Pipelines
There’s a growing category of tools that help data engineers build production data pipelines and orchestrated workflows. But most data engineering organizations still struggle with DataOps and MLOps (check out the countless blog posts on ML struggles in production).
From our vantage point, we see the root of the problem as poor visibility into data pipelines. Pipelines are the backbone behind data science activities. They generate the data sets that data scientists use to produce and test models. And pipelines are also the backbone of the production systems that run models — responsible for delivering clean data into models, as well as running scheduled processes for retraining and batch scoring.
So understanding the pipeline is a pre-requisite for guaranteeing data quality and setting up standards for testing and reliable production operations.
But when pipeline systems start to scale, visibility into what’s happening becomes a big problem. Data pipelines run with a combination of powerful but complex tools like Airflow, Spark, Kubernetes, and various databases. The diversity of tools is necessary for a lot of teams because it provides choice, extendibility, and using best of breed platforms at each layer of the stack. But combining all these engines together, on top of different cloud services, languages and libraries, makes gaining full visibility into things impossible, or at very least a huge undertaking.
What’s Available Today?
Most data engineering teams today use standard APM tools to monitor their data stack. The APM market has solutions that are extensible enough where teams can define the right level of logic to make them usable. And at a certain scale this can work very well.
The wikipedia definition of application performance monitoring:
“In the fields of information technology and systems management, application performance management is the monitoring and management of performance and availability of software applications. APM strives to detect and diagnose complex application performance problems to maintain an expected level of service.”
However, APM tools were built for software engineering and devops, not data engineering. And at a certain point the functionality gaps become really apparent. These solutions do not contextualize metrics or events in ways that are easily discoverable for data engineers, and do not have easy ways of creating logic that “understands” the nuances of how data pipelines operate (like the totally normal behavior of having many successive failures before a batch process runs successfully). As a result, many data engineering teams end up flooded with junk alerts and unreadable charts.
The root of the issue is that APM tools were not built to collect and handle the right data in the right way for data engineers. The best APM tools in use today follow the Observability model for providing deep understanding of a technology system. This involves bringing together three monitoring components to spot and fix failures:
For data engineering, observability is different. There’s additional dimensions you need to monitor in addition to the standard set, especially related to your underlying data flows (are there issues in data quality), the schedules on which batch processes execute (are pipelines running at the right time), and the internal and external dependencies that link all your pipelines together (where are issues coming from upstream, how will they propagate downstream).
The Good News
The fact that data engineering is taking off, and that teams are struggling with monitoring their data stacks, is a signal that the market is maturing. That’s exciting because it foreshadows a new wave of data product innovation that will continue to delight consumers in realms like healthcare, financial services, and retail technology.
Data engineering teams are at the forefront of driving this market innovation and maturity. While at current state, there are huge pain-points around guaranteeing seamless operations, many of the leading teams are investing in extending their monitoring tools to handle these new types of challenges.
Databand is a solution for data engineering observability. Databand fits natively in the data engineering workflow and tech stack, providing you deep visibility into your pipelines so that you can quickly detect, diagnose, and fix problems.