In the past decade, DevOps engineers started noticing that they’d get frantic calls from their CEO if the application ever went down. This is how they knew they had become vital. Nowadays, many DataOps engineers are receiving that same honor—and the PagerDuty alerts to match.
This reliance on DataOps and data monitoring is going to increase. With the rise of analytics (now a $105 billion market), machine learning ($8 billion market), and the importance of data to the functioning of all software as a service ($157 billion market), data powers the internet. Pipelines power that data. Yet too few engineers are building those pipelines with data monitoring in mind. When things go down, many DataOps teams are left grasping in the dark. Even when things aren’t down, they live in perpetual job-fail anxiety.
In this guide, we explore the vital importance of data monitoring to DataOps, why it becomes such an issue with large-scale or complex pipelines, and share a handful of best practices.
What is data quality monitoring?
If we’re going to define it, data quality monitoring is the ongoing process of measuring your data’s fitness for use. It isn’t taking action to address those issues—that’s beyond the scope of monitoring. Monitoring is simply knowing, in great detail, what’s happening within your data pipelines.
Monitoring for data quality is important because issues with data will propagate through the pipeline and the negative effects can cascade. If the source data is tainted, everything that follows will be too. Without the right tools, it’s very difficult to identify the source of the corruption and trace any upstream or downstream processes that have been affected.
Data monitoring is one part of the equation
The terms “monitoring” and “observability” are often used interchangeably, but there’s a distinction: Monitoring is just one piece of observability. Observability is the umbrella term for all the actions around understanding and improving the health of your pipeline, such as tracking, alerting, and recommendations. Yet the monitoring part (and the accuracy of the monitoring) are crucial.
Without the awareness that monitoring provides, you can’t take action to influence data quality. Not in any scientific way, at least. It’s tough to troubleshoot, and a pipeline without a monitoring tool integrated is a black box—you know what goes in and what comes out, but that’s it. A data monitoring software is what detects the errors or strange transformations, and tells you where they’re occurring.
For a data monitoring system to be useful, it must be:
- Granular—it must indicate specifically where an issue is occurring, and with what code.
- Persistent—you must monitor things in a time-series, otherwise you can’t understand where data sets or errors began (lineage).
- Automatic—the more freedom you have to set thresholds and use machine learning and anomaly detection, the less active attention it requires.
- Ubiquitous—you can’t measure just one part of the pipeline.
- Timely—because, what good are late alerts?
If you’re planning on starting to monitor pipelines and are considering using your existing application performance management (APM) tool, think again. Pipelines are a very different beast and you’re not going to get the granularity of data or the metrics you need to understand all four factors of data health. You will be able to extract duration, uptime, and some logging information, but you’ll be missing all the necessary and actionable information like data schema changes, granular task information, query costs, and other specific metrics.
The challenge with large-scale data pipelines
More complicated transformations, more operators touching the pipelines, and little coordination between operators begets vastly more complex DataOps systems. That’s where we’re at today—too many cooks and no prix fixe menu for what’s allowed and what isn’t.
Among the greatest challenges is how many non-technical participants are now reliant upon data pipelines to do their job. Demands come raining in from the business side from people—executives, analysts, and data scientists—who, through no fault of their own, don’t understand the data pipeline architecture. They don’t know the quirks of how the data is moved and stored. Yet they’re the ones deciding what must ultimately be delivered.
This is a big reason 9 out of every 10 data science projects fail to make it into production. They lack a common language and fail to involve the data engineer early on, in the requirements phase, when fixes are still cheap.
It’s a similar story for machine learning pipelines: Running the model and maintaining the model are more difficult with more people involved and no common language and not enough inter-group processes.
All this makes a case for data pipelines that are modular, more easily debugged, and well-monitored. Hence, data monitoring software.
Data monitoring best practices
To explain the order of operations you should go through to monitor your data pipeline, we’ve created what we call the data observability pyramid of needs, pictured. It’s your first data monitoring best practice.
The pyramid begins at the bottom, with the physical layer—are the pipelines executing? Did the Spark job run?—and proceed up into the increasingly theoretical realm. More advanced teams tend to be dealing with more higher-order issues at the top.
To put this pyramid into practice, your data observability system should be checking for these issues in this order:
1. Is data flowing?
2. Is the data arriving in a useful window of time?
3. Is the data complete? Accurate? Fit?
4. How has it been changing over time? (Also called data lineage)
5. Are the people who need the data actually getting it?
To manage all of this automatically, there are of course data monitoring tools.
Data monitoring tools
Like infrastructure as a service in DevOps, monitoring tools are best to buy not build. There’s a lot that goes into data monitoring and having a data monitoring system that’s maintained and improved can be a big time-savings, and free you to actually manage the pipeline.
Monitoring is most often one feature of a data monitoring service or platform. These data monitoring apps tend to also provide tools for awareness and remediation, suck as tracking, alerts, and machine learning for anomaly detection.
Which is the best data manager app? We’re biased, but for data engineers, Databand.ai is certainly on the list. We built it to provide full observability for data and machine learning pipelines for all the reasons covered in this article—because when suddenly, your CEO cares to know whether the pipeline is up, it pays to monitor it.
Was this article helpful? Explore more of our guides and we’d love to hear your feedback and ideas.
Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. To learn more about Databand and how our platform helps data engineers with their data pipelines, request a demo or sign up for a free trial!