With more and more companies depending on data for critical operations and decision-making, it’s more important than ever to build reliable data pipelines that deliver quality data in a predictable way. If pipelines constantly fail, finish late, or produce low quality data, stakeholders lose trust. Given the lack of purpose-built solutions in the market, the data engineers who manage these processes have their work cut out for them.
A lot of teams don’t yet have best practices established to help them break down the big challenge of DataOps monitoring and tackle things incrementally. That’s why we want to share some of our perspective on how to build data trust, brick by brick, tackling DataOps Observability.
DataoOps Observability Pyramid of Needs
Is any data flowing?
DataOps teams operate on faith that errors will be alerted through their workflow management (orchestration) tools. When an error inevitably occurs, you want to know as quickly as possible. Most orchestrators, including Apache Airflow, have basic error alerting available.
When I was starting out with Airflow, I was working on integrating new data sources and improving existing data pipelines at the same time. One particularly memorable error raised no error at all. I was starting out on the Sequential Executor, so my Airflow scheduler got stuck on a task for a single DAG. It didn’t throw any error, being stuck in the running state, and I hadn’t built time-out alerting yet. Thankfully, Airflow provides a feature for task time outs specifically for Service Level Agreements (SLAs) and within task operators as well. However, relying on “pass”, “fail”, metrics on your data pipelines does set you up for errors that don’t get noticed.
Basic alerting on task states, sent through emails or Slack, might be sufficient for teams that are just starting out. If pipelines are more straightforward, they have a smaller range of potential issues. However, as infrastructure becomes more complex, this level of alerting alone does not deliver the level of trust needed for data engineering at scale.
Is data arriving in a usable window of time? (Are you meeting your SLAs?)
After basic execution, there’s the question of latency. Late data is unusable for daily decision making.
Taking the example of a product team analyzing product usage information…
Product teams today live in a world of continual release with changes often rolled out multiple times daily. These teams need to track information from those releases such as how many users are clicking into new features to help gauge success. Product usage data has to be delivered in a predictable, consistent window after release. If release decisions are made on a daily basis, the data that supports those decisions needs to be consistently delivered at the right time daily.
And when that information becomes reliable, more teams will use it:
- Product owners that make decisions based on data that is more in line with current reality.
- Data Analysts that report on trends, patterns, and any unexplained variances also need up to date data.
- Customer Support representatives that can be alerted to any Data integrity issues that may raise unnecessary tickets and increase their workload.
At this level of DataOps observability we can rely largely on tools like Airflow and Prometheus to tell us that (1) our pipelines have succeeded and (2) they’ve succeeded in the proper window of time.
Is the data coming through valid and complete? Are there errors in the data itself?
On top of knowing that our data has migrated successfully and is on time, we also need to know what the data looks like. It is not enough to point to superficial success states, and efficient runtimes, without addressing any business needs. Examining the contents of your data, ensuring quality where possible, is how DataOps teams can more effectively address business needs.
Currently, there is a growing interest in Data Quality from Data Engineering. However, with competing business demands, Data Quality can often take a backseat to adding a new data source or cutting compute costs. Once a Data Engineer does find time to write Data Sanity checks, they’re ad-hoc and often incomplete. Successful Data Pipeline executions can still contain false positive errors that cause issues of incomplete or inconsistent data down the road. These hidden errors diminish the potential of data driven organizations.
This is where Databand really comes in. Databand takes away the Engineering expense of ad-hoc Data Sanity checks by telling you if the data is complete, schemas are intact, and high-level metadata such as counts are all reasonable. All of this is done through Databand’s API that collects pipeline metadata (data about your data). Databand’s value prop shines in this step of the Pyramid because of its ability to quickly collect and report metadata metrics that are normally hard to extract and organize.
Are there important changes in the data that I should know about?
Databand not only provides metrics on your last Data Pipeline runs, but also the trends of those metrics. While production data often stays in the general structure, the values and content of that data can dramatically change and those changes can be very hard to detect. Back to our product usage example, it is obviously really important to know as soon as possible when feature usage has spiked or precipitously dropped. That kind of insight is discoverable in the data but hard to extract without easy to use, scalable tooling.
It used to be that only the biggest enterprises had access to unlimited compute resources to complete their Data Pipelines within set SLAs. With easy to onboard cloud services, this is now a reality for most data teams. However, those bills can easily add up and surprise you. Keeping tabs on spend and optimizing computational costs to a minimum is critical for preserving budgets. Pipeline Metadata Trends, in this case, includes where the computational costs are coming from and if they are proportional to the task of the process. DataOps observability doesn’t just mean spitting out a bar chart of CPU usage, but rather digging in to why computational costs may have unexpected variances. Databand allows companies to see cost attribution across multiple attributes; functional code, CPU, time to run, I/O, data read and write. Without DataOps observability, cost reduction would often be a reactive effort.
These changes in your data over time can be often invisible to a Data Engineer focused on, “yes” or “no” successful executions of their pipelines and integrating new sources into their existing infrastructure. Databand provides a leading indicator for data that changes over time, so that business can be proactive.
How do issues below, combined with organizational policies, map to problems in how people are actually working with the data?
Businesses that monitor data execution, latency, sanity, and trends, can serve timely and quality data to its users. At this pyramid tier, both data users and providers are aware of any pipeline issues. Transparent data workflows allow both Data Producers and Consumers to speak the same language.
Another benefit of this transparency is that an organization can align Engineering efforts with actual data usage. I worked at a mid-sized company that had 1300 analysts, most of which were focused on revenue optimization. These analysts were a critical part of the business and the business was justifiably thinking on how to optimize their analysis. Their efforts were twofold. First; train existing python Data Analyst talent on Airflow so they could build their own pipelines. Secondly, build visibility into which database objects their analysts used the most often. By building out this highly ad-hoc DataOps Observability platform, they were attempting to capture the observability that Databand comes with, out of the box.
Tracking which databases, tables, and columns are often used allows a business’s engineers to optimize for computation and storage costs with more precise Data Definitions. This is a need for large organizations with a large amount of data consumers like Data Analysts, Product Managers, or Customer Technical Support. By looking at their actual usage of data, the backend data infrastructure can be optimized for reality, rather than a diagram invented in a boardroom.
In conclusion, the DataOps Observability Pyramid of needs is about generating trust in your company’s data. It’s not enough to move data from one place to another, successfully. Data Ops teams must also ask deeper questions about their data; is it of high quality, timely, and consistent with what we have seen in the past?
Although all Data Workflow software tools out there will get a business as far as Pipeline Latency, Databand excels at DataOps Observability. Observability is deep monitoring of your Data Systems. Observability allows your DataOps team to build trust in their insights, because you can assure that your Data is consistent with itself and reality.