Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

Monitoring for data quality issues as early as ingestion: here’s why

Databand
2022-01-24 15:17:13

Maintaining data quality is challenging. Data is often unreliable and inconsistent, especially when it flows from multiple data sources. To deal with quality issues and prevent them from impacting your product and decision-making, you need to monitor your data flows. Monitoring helps identify schema changes, discover missing data, catch unusual levels of null records, fix failed pipelines, and more. In this blog post, we will explain why we recommend monitoring data starting at the source (ingestion), list three potential use cases for ingestion monitoring, and finish off with three best practices for data engineers to get started.

This blog post is based on the first episode of our podcast, “Why Data Quality Begins at the Source”, which you can listen to below or here.

Where Should You Monitor Data Quality?

Data quality monitoring is a fairly new practice, and different tools offer different monitoring capabilities across the data pipeline. While some tools monitor data quality where the data rests, i.e at the data warehouse, at Databand we think it’s critical to monitor data quality as early as ingestion, not just at the warehouse level. Let’s look at a few reasons why.

4 Reasons to Monitor Data Quality Early in the Data Pipeline

1. Higher Probability of Identifying Issues

Erroneous or abnormal data affects all the other data and analytics downstream. Once corrupt data has been ingested and flows to the data lake/warehouse, it might already be mixed up with healthy data and used in analyses. This makes it much more difficult to identify the errors and their source, because the dirty data can be “washed out” in the higher volumes of data that sits at rest

In fact, the ability to identify issues is based on an engineer or analyst knowing what the expected data results should be, recognizing a problematic anomaly, and diagnosing that this anomaly is the result of data and not a business change. When the corrupt data is such a small percentage of the entire data lake, this becomes even harder. 

Why take the chance of overlooking errors and problems that could impact the product, your users, and decision making? By monitoring early in the pipeline, many issues can be avoided because you are monitoring a more targeted sample of data, and therefore able to create a more precise baseline for when data looks unusual.

data quality monitoring
2. Creating Confidence in the Warehouse

Analysts and additional stakeholders rely on the warehouse to make a wide variety of business and product decisions. Trusting warehouse data is essential for business agility and for making the right decisions. If data in the warehouse is “known” to have issues, stakeholders will not use it or trust it. This means the organization is not leveraging data to the full extent.

If the data warehouse is the heart of the customer-facing product, i.e the product relies almost entirely on data, then corrupt data could jeopardize the entire product’s adoption in the market.

By quality assuring the data before it arrives to the warehouse and the main analytical system, teams can improve confidence in that “trusted layer.

3. Ability to Fix Issues Faster

By identifying data issues faster, data engineers have more time to react. They can identify causality and lineage, and fix the data or source to prevent any harmful impact that corrupt data could have. Trying to identify and fix full-blown issues in the product or after decision-making is much harder to do.

4. Enabling Data Source Governance

By analyzing, monitoring and identifying the point of inception, data engineers can identify a malfunctioning source and act to fix it. This provides better governance over sources, in real-time and in the long-run.

When Should You Monitor Data Quality from Ingestion?

We recommend monitoring data quality across the pipeline, from ingestion and at rests. However, you need to start somewhere… Here are the top three use cases for prioritizing monitoring at ingestion:

  • Frequent Source Changes – When your business relies on data sources or APIs where data structure frequently changes, it is recommended to continuously monitor them. For example, in the case of a transportation application that pulls data from the constantly changing APIs of location data, user tracking information, etc.
  • Multiple External Data Sources – When your business’s output depends on analyzing data from dozens or hundreds of sources. For example, a real-estate app that provides listings based on data from offices, municipalities, schools, etc.  
  • Data-Driven Products – When your product is based on data and each data source has a direct impact on the product. For example, navigation applications that pull data about roads, weather, transportation, etc.

Getting Started with Data Quality Monitoring

As mentioned before, data quality monitoring is a relatively new practice. Therefore, it makes sense to implement it gradually. Here are three recommended best practices:

1. Determine Quality Layers

Data changes across the pipeline, and so does its quality. Divide your data pipeline into various steps, e.g the warehouse layer, the transformation layer, and the ingestion layer. Understand that data quality means different things at each of these stages and prioritize the layers that have the most impact on your business.

2. Monitor Different Quality Depths

When monitoring data, there are different quality aspects to review. Start with reviewing metadata and ensuring the data structure was correct and that all the data arrived. Once metadata has been verified, move on to address explicit business-related aspects of the data, which relate to domain knowledge.

3. See Demos of Different Data Monitoring Tools 

Once you’ve mapped out your priorities and pain points, it’s time to find a tool that can automate this process for you. Don’t hesitate to see demos of different tools and ask the hard questions about data quality assurance. For example, to see a demo of Databand, click here. To learn more about data quality and hear the entire episode this blog post was based on, visit our podcast, here.

How to evade tech debt jail while building data pipelines

Databand
2021-08-02 09:53:55

Data tech debt is something you can feel without ever having to see it. 

You can get a gut feel that the data pipelines are wonky—that transformations are occasionally inexplicable, or that the resultant values “seem off.” You know there’s a ghost in the machine. But it’s only with observability tools that you can find the root causes. And once you identify them, you’ll realize that “debt” is a nearly perfect analogy. 

Like monetary debt, tech debt always sounds like a good idea at the time. You think, “I’ll get the thing now, pay for it later, and we’ll be fine.” But you can never understand the true cost upfront. Delayed decisions saddle your future self with compounding obligations. After not too long, you can find yourself in data debt jail: forever paying the interest in the form of trouble tickets, unable to address the principal. 

So how do you avoid tech debt while building data pipelines? You start by understanding how it gets in there in the first place.

What is the purpose of a data pipeline? 

To understand how tech debt works its way into data pipelines, it’s helpful to return to the basics. A data pipeline is a set of programmed actions that extract data from a source and transform and load it into another. It’s data in motion. And things in motion have a habit of shifting.

Anywhere there are variable elements, and you’re time-constrained and forced to make tradeoffs, debt can slip in. 

The three big areas of data pipeline variability are:

  1. The data pipeline changes the data: Transformations are inherent—whether before or after you load—and those changes introduce the potential for error.
  2. The data itself changes: If a partner changes their API or schema, that data may not be delivered, or delivered wrong. 
  3. The data pipeline itself changes: You’ll likely develop and improve your pipeline. Each new stage, transformation, or source introduces the potential for error.

And, all these changes collide and interact, like molecules in a storm cell. They create a system that’s not merely complicated (many moving parts) but complex (many interrelated parts). That makes pipelines a recursive problem, compounded by people.

If an error introduced at extraction leads to null values that lead to wrong (but not incomplete) data on an end dashboard, someone may make a decision based on it. Let’s say someone on the product team pulls a proverbial “break glass in case of emergency” lever and calls everyone on duty to react to a steep usage dropoff that didn’t really happen. Now, you cannot simply fix the extraction error. You have to fix your system to guard against such errors, but also accept that your product team now has data trust issues. Those issues may make future alerts ineffective, and if people aren’t using the data, the data and pipeline system can decay. The errors compound and cascade. 

For that reason, knowing exactly what goes wrong and where, and catching it early (and preempting it before anyone else knows) is core to building data pipelines.

pipeline comparison tech data databand

How do you create reliable data pipelines? 

To create reliable data pipelines, we recommend following five steps in the planning stages. Debt is easiest to eradicate before it exists. One of the most helpful places to begin? By drawing your data pipeline.

1. First, diagram your strategy

Draw a diagram of your data pipeline architecture, whether in PowerPoint, Miro, or on actual, physical paper. The value of this exercise is you may find that some areas are difficult to draw. Perhaps you leave a big question mark. Those are areas to investigate. What are the hidden dependencies? What’s missing from your understanding? 

Specifically, use that diagram to define:

  • The questions users can answer with this data
  • What exists upstream
  • Dependencies at each stage
  • Systems and tools at each stage (current or desired)
  • Functional changes
  • Non-functional changes
  • Data owners at each stage (and who needs to be notified)
  • Other considerations when building data pipelines

Don’t get too caught up comparing your data pipeline architecture diagram to someone else’s at a different company. Each is as unique as each business. As outdoor adventurers say, there is no bad weather—only bad equipment. In data engineering, there are no bad data pipeline tools—only wrong applications. Don’t hate the tool, hate the use case.

Pictured, an example of different tools you can use at different stages of your data stack:

data pipeline architecture apache airflow chart databand

2. Build for data quality

Building data pipelines for data quality means starting with the assumption that your pipeline needs to guarantee data fitness, lineage, governance, and stability. This is not a common approach. Without the understanding that quality matters most, teams tend to build data pipelines for throughput. It’s, “Can we get the data there?” not “Can we get high-quality data there?”

Thinking about quality can encourage you to think differently about the importance of storing events and states as compared to latency. Building for quality starts in your data architecture diagram.

3. Build for continuous integration and deployment (CI/CD)

Testing is cheap and collaboration and versioning tools like Git and GitLab mean you really have no excuse not to practice CI/CD when building data pipelines. It’s a best practice, and given the temporal and chaotic nature of data quality issues, debt will accrue while you’re waiting for the release window.

4. Build to debug the process, not just the code

Build full pipeline observability into the architecture from day one. As we’ve discussed before, “Building an airplane while in flight” is not the right analogy. “Building an architecture” is. Your pipeline needs to be built to track and monitor every component so you can isolate incidents. You need context for system metrics and a deeper view of operations. You need alerts for when things go wrong, in Slack or via PagerDuty, so you can address and correct them before the debt accrues. 

Specifically, an observability tool (like Databand) can provide: 

  • Alerts on performance and efficiency bottlenecks before they affect data delivery
  • Unified view your pipeline health including logs, errors, and data quality metrics
  • Seamless connection to your data stack
  • Customizable metrics and dashboards
  • Fast root cause analysis to resolve issues when they are found
  • Insights into building data pipelines

When you can see everything in your data pipeline, you’re more likely to identify data tech debt early. Which, while initially a lot more work, is a big time-saver. That upfront price of addressing errant transformations or code snippets pays dividends. It’ll mean you won’t discover the foundation is cracked until after the entire company relies on it. 

Krishna Puttaswamy and Suresh Srinivas on Uber’s data engineering team explain it this way:

“While services and service quality tend to get more focus due to immediate visibility in failures/breakage, data and related tools often tend to take a backseat. But fixing them and bringing them on par with the level of rigor in service tooling/management becomes extremely important at scale, especially if data plays a critical role in product functionality and innovation.”

5. Front-load the difficult decisions

Take a tip from couples therapy: Address things in the moment, as they arise. Don’t let issues fester. As part of your data operations manifesto, announce that you’ll never put off a difficult decision because you understand the compounding cost. Make that public, make it part of your culture, and make it a reality. 

This is not to say you can’t run tests. If two technologies seem like equivalents and the decision is reversible, just try it. But where it isn’t reversible, and to build everything on top of a component you’re selecting would restrict your future choices, take the time. 

Ask leadership for the latitude to take time to make difficult decisions, so you make them early, and don’t put things off. Delaying decisions saddles you with future obligations and that is the source of nearly all data tech debt.

Building data pipeline architectures to be tech debt free

The best data pipelines are built by the experienced. It helps to have placed pipelines into production and felt the fear of failure to know what it takes to build good ones. Mistakes are the best teacher. But, you can avoid many of them all the same by knowing your sources of variability, documenting carefully, and following the five steps outlined above. 

If you diagram, build for quality, integrate continuously, implement an observability tool, debug the process itself, and front-load difficult decisions, you’re far better off than most.

And remember. There’s no cognitive error more common in engineering than shackling your future self with all manner of obligations because you took shortcuts. We always imagine our future selves to have a lot more free time than our present selves. But it ends up, they’re a lot like us. You’ll be just as busy then, if not more so. Do yourself a favor and stay out of data tech debt jail. Go slow to go fast when building data pipelines.

Advanced alerting on Airflow with Databand.ai in 3 steps

Databand
2020-05-05 08:13:00

Comprehensive monitoring in Airflow is hard to achieve — let alone advanced alerting in Airflow. While Airflow has a great UI for monitoring high-level schedules and task statuses, there’s a lot of room for improvement in regards to pipeline visibility. In fact, nearly half of respondents reported that logging, monitoring, and alerting on Airflow are top challenges or areas of improvement in the 2020 Airflow user survey. As power users and contributors ourselves, we’re intimately familiar with these challenges.

This is a big problem because Airflow is a standard in the data engineering toolkit. It’s become so important that huge organizations like Adobe, Walmart, Twitter, and more, have actively contributed to Airflow’s development. That’s because these organizations have invested record amounts in their data infrastructure and the creation of data products. 

That means the stakes are higher than ever for data engineers. They have the difficult task of discovering and fixing a data quality problem they reach their consumers.

A data engineer’s primary function is to create trust in their data. As a data engineer, your goal should be creating data architecture that fosters data production so accurate and efficient that your Slack and email address become forgotten by your data consumers. The only way you can do that is if you can see issues in your Airflow pipeline before they affect data delivery. You need to proactively solve problems so that your data SLAs aren’t at risk. In this article, we’ll show you how to implement advanced alerting in Airflow using Databand.

Monitoring for Airflow and 20+ other tools on Databand.ai

Databand.ai is a unified data observability platform that helps you identify and prioritize the pipeline issues that impact your data product the most.

The Databand Platform tracks pipeline custom metadata and provides alerting, trend analysis, and debugging tools so you can compare pipeline performance and quickly discover bottlenecks.

Databand.ai can tap into metadata like runtime statuses, logging info, and data metrics, and uses that info to build alerts on leading indicators of data health issues. You can proactively maintain the integrity of your data infrastructure and the quality of your data product with Databand. Databand.ai sends alerts to your email, Slack, Pagerduty, and others, so your DataOps teams can get ahead of any pipeline issues.  

Early Warning Signals on Long-Running Processes

Airflow is often used for orchestrating long running processes that take hours, days, or longer to complete.

Let’s take this example pipeline (also called a “DAG”) of a three step Python process:

What makes a simple pipeline like this a long running process? A lot of factors could be the culprit, including abnormally large volumes of “big” data, complex task logic, or database pooling causing waits in a queue.

Poor data quality and late delivery in critical pipelines spell trouble for your data product.  End-users like data scientists, analysts, and business users depend on pipelines finishing within a certain window of time.

Being able to receive an early warning signal of pipeline failures or delays is a huge advantage—especially when other organizations lose money in these situations. 

As a real-world example, let’s say our simple pipeline above was built by a stock trading company for analyzing market data. If the job starts at midday using prior day trading data and takes 4 hours to complete, it would need to finish around 4 pm to leave time for analysis and next-day strategy planning. If the job is running late, you want that notification as early in the process as possible.

Whereas an alert later in the process could mean losing an entire day of work, getting an alert at 1pm that tasks are delayed lets you get started on a resolution now. That early notice gives your team the ability to identify the cause of delay, fix the problem, and rerun the job before delivery is due.

Databand.ai can give you an early heads-up so you can intervene and fix the problem fast without losing time and money on wasted processing. With zero changes to your project’s existing code, you can use Databand to create robust alerting that gives you alerts on leading indicators of failures and delays.

Step 1 — Creating Leading Indicators of Problems

While you can use Databand’s production alerting system to notify on the duration of the overall pipeline, this alert is a lagging indicator of a problem. We want leading indicators — early warning signs.

A better approach is to use Databand.ai to alert on runtime properties of individual tasks in the pipeline. Databand.ai will monitor tasks as they run and send alerts if there are failures (or some other status change), if a task has an unusual number of retries, or if task duration exceeds some threshold. You can also receive an alert if a task does not start at the expected time. Setting these alerts on task runtime properties will give us insights on when data is at risk for late delivery much earlier than traditional monitoring methods.

Step 2 — Diving into the Data

So far, you’ve learned how to set run and task level alerts. With Databand helping you track those metrics, you’ll have a much stronger detection system for pipeline issues.

What are you missing? 

To properly support our data consumers, it’s not enough to know that pipelines will complete on time. You also need to know that the data we are delivering is up to quality standards. Can the end-user actually work with it?

Luckily, you can use the Databand logging API to report on metrics about our data sets. 

Databand can automate this process, but for this example, you’ll be defining which custom metrics are going to be reported whenever your pipeline runs.

When Airflow runs the pipeline code, Databand’s logging API will report metrics to Databand’s tracking system, where we can alert on things like schema changes, data completeness rules, or other integrity checks.

In the above screenshot, the logging API is reporting the following metrics from the pipeline:

  • avg score: a custom KPI reported from the create_report task
  • number of columns: schema info about the data set being processed
  • removed duplicates: reported from the dedup_records task
  • replaced NaNs: this metric is reported from the first task in the pipeline, unit_imputation

Databand.ai can alert on any metrics reported to the system, and you are free to create as many as needed.

Step 3 — Using Metrics as Progress Checks

Now, it’s time to add data visibility into your alerting process.
Using Databand.ai’s advanced alerting and anomaly detection, you can use the same metrics tracking function for gaining insight into internal task progress. This is particularly useful when you have complex pipelines with lots of tasks and subprocesses, or a few tasks with complex logic inside them.

With conditional alerts, we can create intra-task status checks by tracking if a metric fails to report within a running task within a certain timeframe.

In our example, the first task, unit_imputation, reports the number of replaced NaNs in our data set. Looking at historical run trends, you can expect the overall task to complete within 1 hour of time. Based on where you place the metrics log in our code, the metric is usually reported about 30 mins after the task starts. You can use the expected behavior to create a conditional alert that gives you great insights into what’s happening inside your process.

Alert Logic:

If duration of unit_imputation task is greater than 45 minutes AND replaced NaNs metric is missing, THEN trigger alert

First, let’s describe the logic behind this alert.
The NaNs metric should be reported about halfway into task runtime, which typically takes 1 hour to fully complete. Your alert adds some buffer to that, saying we want a notification if the task runs for 45 minutes without sending any metric. Missing the metric over this duration of time serves as an early warning that the pipeline is hung up on something that could lead to further delays downstream. Since the alert is set up in the first task of the pipeline, you have enough time to check out the issue, and restart the process after a fix.

And what was required to set this up? Only a single use of the logging API. 

Databand’s alerting framework configures the rest in minutes.

Airflow alerting for staying proactive, not reactive

After following the provided logic, your alerting coverage includes:

  • Overall run durations
  • Task durations
  • Data quality metrics
  • Internal task progress

You’re now in a much better position to anticipate failures, delays, and problems in data quality before they affect your data product.
Databand.ai makes the process easy. Get your first alerts up and running with out-of-the-box metrics tracking, and get deeper performance insights with Databand.ai.

Start your 30-day free trial and get deeper visibility into your critical data pipelines.

Best Practices in Building an Ops-Ready ML Pipeline

Databand
2020-04-02 12:54:52

Machine Learning Observability

Our company Databand focuses on data engineering observability. We help teams monitor large scale production data pipelines, for ETL and ML use cases. We’re excited to lead a workshop on Machine Learning Observability and MLOps at the upcoming ODSC conference.

By the end of the workshop, you’ll learn how to structure data science workflows for production automation and introduce standard logging for performance measurement. In other words, make the process observable!

This blog describes what we’re covering in the workshop and the earlier webinar. You can find more info on the ODSC sessions and resources at the end of the page. Hope to see you there!

 

The Challenge

9 out of every 10 data science projects will fail to make it into production. It’s such a widely discussed problem it’s a cliche statistic. But for the lucky 10%, finally making it into production is not the end of the story. After you’re in production, you’re now in the business of maintaining your system.

Why is that hard to do? The “Ops” practices for data science and engineering are not yet defined. On top of that, the Ops professionals who focus on managing production applications are accustomed to a certain way of working and have expectations about the systems they operate. They have tools and best-practices for monitoring applications, with measurable performance indicators and testing procedures that give them confidence in deploying services to production. There’s nothing of this sort that helps Ops teams manage machine learning production today.

Our goal in the workshop will be to make a data science workflow “Ops-ready” for production.

 

What is Production Anyway?

Before we go any further, let’s define what we mean by “production” in the ML context because it’s not always straightforward.

There are usually two related activities in ML production:

  • Running the model — the process of using your model(s) to generate predictions on a live data set. Done as either an online (real-time) or offline (batch) process.
  • Maintaining the model — the process of running your model training workflow on new production data to retrain your model.

Retraining tells you if you need to update your model and is usually a scheduled process that runs on a weekly basis (give or take). Without retraining, models will degrade in performance as your data naturally changes.

For the workshop, we are focusing on maintenance/retraining. Why? If your model is not maintainable, all the value of “getting into production” will be pretty short lived. Second, when you have the right fundamentals and tools in place, maintenance is not so difficult to do. So the net gain is high.

 

The Workshop

During the workshop, we’ll start from a Python model training script in a Jupyter Notebook and transform it into a production retraining pipeline that’s observable, measurable, and manageable from the Ops perspective.

We are going to follow three steps to transform our training code into an observable pipeline.

  1. Functionalize our workflow
  2. Introduce logging
  3. Convert our code to a production pipeline (DAG)

To introduce logging and measurement into our script we’ll use Apache DBND — Databand’s open source library for tracking metadata and operationalizing workflows.

For running the production pipeline, we’ll use Apache Airflow, our preferred workflow scheduler. Airflow is our team’s go-to system for managing production workflows. Airflow’s great at two things: orchestration and scheduling. Orchestration is the ability to run workflows as atomic, interconnected tasks that run in a particular order. A workflow in Airflow is called a DAG. Scheduling is executing DAGs at a particular time. NOTE FOR THE WORKSHOP: We don’t expect attendees to be running Airflow, but we’ll use it from the presenter side to demonstrate our process in production.

 

Functionalizing

The first thing we’ll do to operationalize our workflow is functionalize it, splitting up our steps into discrete functions. The reason we do this is to make the script more modular and debuggable. When running in production, that will make it easier to isolate problems, especially as the workflow grows in complexity in future iterations.

 

Machine Learning Observability

 

Logging

Adding logging to our script will enable us to persist metrics and artifacts to an external system, collecting the metrics every time the Python runs. This is where DBND comes in. DBND will collect and store our metrics in our file system so that we can measure performance in a standardized way.

DBND will track our workflow on three levels:

  • Function input and output (in our example DataFrames and the Python model)
  • Data structure and schema
  • User defined performance metrics

Using these artifacts and metrics will make the workflow Ops-ready by enabling us to reproduce any run, maintain a history of record for performance, and make sure results are consistent at different stages of the development lifecycle (research, testing, production). We’ll be able to introduce standards that Ops can use going forward to monitor the process for issues.

To introduce tracking & logging, all we need to do is annotate our functions with DBND decorators and define our metrics with DBND’s logging API.

In research, we can visualize the metrics as a report directly in our Jupyter Notebook. In the workshop, we’ll show more operation oriented tools for observing metrics and performance in production.

 

Operationalizing

Our last step is transforming the workflow into a pipeline that we can run on Airflow as a scheduled DAG. After using the DBND library in our workflow, all we need to do to run the workflow as an Airflow DAG is to add a DAG definition that defines our functions as tasks and set the CRON schedule for the pipeline. When we add the DAG definition, each of our decorated functions will run as an Airflow task. As our pipeline runs on its schedule, DBND will continue to track input/output. data set information, and logging metrics and store them in our file system.

 

Wrapping Up

At the end of the workshop, we’ll have transformed a model training workflow built in a notebook into a retraining DAG that runs on regular schedule. We’ll have introduced standard logging and tracking that make sure the process is reproducible, testable, and measurable. With this infrastructure, data scientists will be more productive in pushing research into market, and Ops will feel confident that the production ML system is maintainable.

What’s next? Rinse and repeat!

 

Resources

Here is a link to our repo for the workshop.

The ODSC webinar is scheduled for 1:00pm EST on April 7th. Here is the link for signup: https://register.gotowebinar.com/register/5659030622578494477

The interactive workshop itself will be held on April 15th at 9:30am EST. You can visit the ODSC website for more info on how to register.

 

Databand Dataops Observability Platform for Pipelines