Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

What is Data Lineage?

Databand
2022-07-28 10:20:00

The term “data lineage” has been thrown around a lot over the last few years.

What started as an idea of connecting between datasets quickly became a very confusing term that now gets misused often.

It’s time to put order to the chaos and dig deep into what it really is. Because the answer matters quite a lot. And getting it right matters even more to data organizations. 

This article will unpack everything you need to know about data lineage, including:

  • What is it?
  • What’s the difference between data lineage and data provenance?
  • Why is it important?
  • What are common data lineage techniques?
  • What are data lineage best practices?
  • What is end-to-end lineage vs. data at rest lineage?
  • What are the benefits of end-to-end data lineage?
  • What should you look for in a data lineage tool?

What is data lineage?

Its purpose is to track data throughout its complete lifecycle. It looks at the data source to its end location and notes any changes (including what changed, why it changed, and how it changed) along the way. And it does all of this visually.

Usually, it provides value in two key areas:

  1. Development process: Knowing what affects what and what could be the impact of making changes. 
  2. Debugging process: Understanding the severity, impact, and root cause of issues.

In general, it makes it possible to identify errors in data, reduce the risk associated with system and process changes, and increase trust in data. All of these are essential at a time when data plays such an integral role in business outcomes and decision-making.

Data lineage in action: A simplified example

When data engineers talk about it, they often imagine a data observability platform that allows them to understand the logical relationship between datasets that are affecting each other in a specific business flow.

data lineage

In this very simplified example, we can see an ELT:

  • Some pipeline tasks, probably running by Airflow, are scraping external data sources and collecting data from there.
  • Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse).
  • Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.
  • Dashboarding tools such as Tableau, Looker, or Power BI are being used on top of the business data and providing visibility to multiple stakeholders.

What’s the difference between data lineage and data provenance?

Data lineage and data provenance are often viewed as one and the same. While the two are closely related, there is a difference.

Whereas data lineage tracks data throughout the complete lifecycle, data provenance zooms in on the data origin. It provides insight into where data comes from and how it gets created by looking at important details like inputs, entities, systems, and processes for the data.

Data provenance can help with error tracking when understanding data lineage and can also help validate data quality.

Why is it important?

As businesses use more big data in more ways, having confidence in that data becomes increasingly important – just look at Elon Musk’s deal to buy Twitter for an example of trust in data gone wrong. Consumers of that data need to be able to trust in its completeness and accuracy and receive insights in a timely manner. This is where data lineage comes into play.

Data lineage instills this confidence by providing clear information about data origin and how data has moved and changed since then. In particular, it is important to key activities like:

  • Data governance: Understanding the details of who has viewed or touched data and how and when it was changed throughout its lifecycle is essential to good data governance. Data lineage provides that understanding to support everything from regulatory compliance to risk management around data breaches. This visibility also helps ensure data is handled in accordance with company policies.
  • Data science and data analytics: Data science and data analytics are critical functions for organizations that are using data within their business models, and powering strong data science and analytics programs requires a deep understanding of data. Once again, data lineage offers the necessary transparency into the data lifecycle to allow data scientists and analysts to work with the data and identify its evolutions over time. For instance, data lineage can help train (or re-train) data science models based on new data patterns.
  • IT operations: If teams need to introduce new software development processes, update business processes, or adjust data integrations, understanding any impact to data along the way –  as well as where data might need to come from to support those processes – is essential. Data lineage not only delivers this visibility, but it can also reduce manual processes associated with teams tracking down this information or working through data silos.
  • Strategic decision making: Any organization that relies on data to power strategic business decisions must have complete trust that the data they’re using is accurate, complete, and timely. Data lineage can help instill that confidence by showing a clear picture of where data has come from and what happened to it as it moved from one point to another.
  • Diagnosing issues: Should issues arise with data in any way, teams need to be able to identify the cause of the problem quickly so that they can fix it. The visibility provided by data lineage can help make this possible by allowing teams to visualize the path data has taken, including who has touched it and how and when it changed.

What are common techniques?

There are several commonly used techniques for data lineage that collect and store information about data throughout its lifecycle to allow for a visual representation. These techniques include:

  • Pattern-based lineage: Evaluates metadata for patterns in tables, columns, and reports rather than relying on any code to perform data lineage. This technique focuses directly on the data (vs. algorithms), making it technology-agnostic; however, it is not always the most accurate technique.
  • Self-contained lineage: Tracks data movement and changes in a centralized system, like a data lake that contains data throughout its entire lifecycle. While this technique eliminates the need for any additional tools, it does have a major blind spot to anything that occurs outside of the environment at hand.
  • Lineage by data tagging: A transformation engine that tags every movement or change in data allows for lineage by data tagging. The system can then read those tags to visualize the data lineage. Similar to self-contained lineage, this technique only works for contained systems, as the tool used to create the tags will only be able to look within a single environment.
  • Lineage by parsing: An advanced form of data lineage that reads logic used to process data. Specifically, it provides end-to-end tracing by reverse engineering data transformation logic. This technique can get complicated quickly, as it requires an understanding of all the programming logic used throughout the data lifecycle (e.g. SQL, ETL, JAVA, XML, etc.).

What are data lineage best practices?

When it comes to introducing and managing data lineage, there are several best practices to keep in mind:

  • Automate data lineage extraction: Manual data lineage centered around spreadsheets is no longer an option. Capturing the dynamic nature of data in today’s business environments requires an automated solution that can keep up with the pace of data and reduce the errors associated with manual processes.
  • Bring metadata source into data lineage: Systems that handle data, like ETL software and database management tools, all create metadata – or data about the data they handle (meta, right?). Bringing this metadata source into data lineage is critical to gaining visibility into how data was used or changed and where it’s been throughout its lifecycle.
  • Communicate with metadata source owners: Staying in close communication with the teams that own metadata management tools is critical. This communication allows for verification of metadata (including its timeliness and accuracy) with the teams that know it best.
  • Progressively extract metadata and lineage: Progressive extraction – or extracting metadata and lineage in the same order as it moves through systems – makes it easier to do activities like mapping relationships, connections, and dependencies across the data and systems involved.
  • Progressively validate end-to-end data lineage: Validating data lineage is important to make sure everything is running as it should. Doing this validation progressively by starting with high-level system connections, moving to connected datasets, then elements, and finishing off with transformation documentation simplifies the process and allows it to flow more logically.
  • Introduce a data catalog: Data catalog software makes it possible to collect data lineage across sources and extract metadata, allowing for end-to-end data lineage.

What is end-to-end lineage vs. data at rest lineage?

When talking about lineage, most conversations usually tackle the scenario of data “in-the-warehouse,” which presumes everything is occurring in a contained data warehouse or data lake. In these cases, it monitors data executions that are performed on specific or multiple tables to extract the relationship within or among them. 

At Databand, we refer to this as “data at rest lineage,” since it observes the data after it was already loaded into the warehouse.

This data at rest lineage can be troublesome for modern data organizations, which typically have a variety of stakeholders (think: data scientist, analyst, end customer), each of which has very specific outcomes they’re optimizing toward. As a result, they each have different technologies, processes, and priorities and are usually siloed from one another. Data at rest lineage that looks at data within a specific data warehouse or data lake typically doesn’t work across these silos or data integrations.

Instead, what organizations need is end-to-end data lineage, which looks at how data moves across data warehouses and data lakes to show the true, complete picture.

Consider the case of a data engineer who owns end-to-end processes within dozens of dags in different technologies. If that engineer encounters corrupted data, they want to know the root cause. They want to be able to proactively catch issues before they land on business dashboards and to track the health of the different sources on which they rely. Essentially, they want to be able to monitor the real flow of the data.

With this type of end-to-end lineage, they could see that a SQL query has introduced corrupted data to a column in a different table or that a DBT test failure has affected other analysts’ dashboards. In doing so, end-to-end lineage captures data in motion, resulting in a visual similar to the following:

data lineage

What are the benefits of end-to-end data lineage?

Modern organizations need true end-to-end lineage because it’s no longer enough just to monitor a small part of the pipeline. While data at rest lineage is easy to integrate, it provides very low observability across the entire system.

Additionally, data at rest lineage is limited across development languages and technologies. If everything is SQL-based, that’s one thing. But the reality is, modern data teams will use a variety of languages and technologies for different needs that don’t get covered with the more siloed approach.

As if that wasn’t enough, most of the issues with data happen before it ever reaches the data warehouse, but data at rest lineage won’t capture those issues. If teams did have that visibility though, they could catch issues sooner and proactively protect business data from corruption.

End-to-end data lineage solves these challenges and delivers several notable benefits, including:

  • Clear visibility on impact: If there’s a schema change in the external API from which Python fetches data, teams need true end-to-end visibility to know which business dashboard will be affected. Gaining that visibility requires understanding the path of data in motion across environments and systems – something only end-to-end data lineage that tracks data in motion can provide.
  • Understanding of root cause: By the time an issue hits a table used by analysts, the problem is already well underway, stemming from further back in the data lifecycle. With data at rest lineage, it’s only possible to see what’s happening in that particular table, though – which isn’t helpful for identifying the cause of the issue. End-to-end lineage, on the other hand, can look across the complete lifecycle to provide clarity into the root cause of issues, wherever they turn up.
  • Ability to connect between pipelines and datasets: In a very complex environment where thousands of pipelines (or more!) are writing and reading data from thousands of datasets, the ability to identify which pipeline is working on a weekly, daily, or hourly bases and with which tables (or even specific columns within tables) is a true game-changer.

What should you look for in a data lineage tool?

As it becomes increasingly important, what should you look for in a data lineage tool? 

Above all else, you need a tool that can power end-to-end data lineage (vs. data at rest lineage). You also need a solution that can automate the process, as manual data lineage simply won’t cut it anymore.

With those prerequisites in mind, other capabilities to consider when evaluating a data lineage tool include:

  • Alerts: Automated alerts should allow you to not just identify that an incident has occurred, but gain context on that incident before jumping into the details. This context might include high-level details like the data pipeline experiencing an issue and the severity of the issue.
  • View of affected datasets: The ability to see all of the datasets impacted by a particular issue in a single, birds-eye view is helpful to understanding the effect on operations and the severity of the issue.
  • Visual of data lineage: Visualizing data lineage by seeing a graph of relationships between the data pipeline experiencing the issue and its dependencies allows you to gain a deeper understanding of what’s happening and what’s affected as a result. The ability to click into tasks and see the dependencies and impact to each one for a given task provides even more clarity when it comes to issue resolution.
  • Debugging within tasks: Finally, the ability to see specific errors within specific tasks allows for quick debugging of issues for faster resolution.

Getting it right

Data lineage isn’t a new concept, but it is one that’s often misunderstood. However, as data becomes more critical to more areas of business, getting it right is increasingly important.

It requires an understanding of exactly what data lineage is and why it’s so important. Additionally, it requires a thoughtful approach to addressing data lineage that matches the needs of a modern data organization – which means true end-to-end data lineage. And finally, it requires the right tool to support this end-to-end lineage in an automated way.

implement end-to-end data lineage

IBM Acquires Databand to Extend Leadership in Observability

Databand
2022-07-06 08:14:51

Today is a big day for the Databand community! 

We’re excited to announce that Databand has been acquired by IBM to extend its leadership in observability to the full stack of capabilities for IT — across infrastructure, applications, data, and machine learning. 

This is beyond exciting news for our team, our customers, and the broader data observability market. 

Click the link to the official press release, read the transcript below, or request a demo to see Databand in action.

IBM Aims to Capture Growing Market Opportunity for Data Observability with Databand.ai Acquisition

Acquisition helps enterprises catch "bad data" at the source
Extends IBM's leadership in observability to the full stack of capabilities for IT -- across infrastructure, applications, data and machine learning

ARMONK, N.Y.July 6, 2022  /PRNewswire/ — IBM (NYSE: IBM) today announced it has acquired Databand.ai, a leading provider of data observability software that helps organizations fix issues with their data, including errors, pipeline failures and poor quality — before it impacts their bottom-line. Today’s news further strengthens IBM’s software portfolio across data, AI and automation to address the full spectrum of observability and helps businesses ensure that trustworthy data is being put into the right hands of the right users at the right time.

Databand.ai is IBM’s fifth acquisition in 2022 as the company continues to bolster its hybrid cloud and AI skills and capabilities. IBM has acquired more than 25 companies since Arvind Krishna became CEO in April 2020.

As the volume of data continues to grow at an unprecedented pace, organizations are struggling to manage the health and quality of their data sets, which is necessary to make better business decisions and gain a competitive advantage. A rapidly growing market opportunity, data observability is quickly emerging as a key solution for helping data teams and engineers better understand the health of data in their system and automatically identify, troubleshoot and resolve issues, like anomalies, breaking data changes or pipeline failures, in near real-time. According to Gartner, every year poor data quality costs organizations an average $12.9 million. To help mitigate this challenge, the data observability market is poised for strong growth.1

Data observability takes traditional data operations to the next level by using historical trends to compute statistics about data workloads and data pipelines directly at the source, determining if they are working, and pinpointing where any problems may exist. When combined with a full stack observability strategy, it can help IT teams quickly surface and resolve issues from infrastructure and applications to data and machine learning systems.

Databand.ai’s open and extendable approach allows data engineering teams to easily integrate and gain observability into their data infrastructure. This acquisition will unlock more resources for Databand.ai to expand its observability capabilities for broader integrations across more of the open source and commercial solutions that power the modern data stack. Enterprises will also have full flexibility in how to run Databand.ai, whether as-a-Service (SaaS) or a self-hosted software subscription.

The acquisition of Databand.ai builds on IBM’s research and development investments as well as strategic acquisitions in AI and automation. By using Databand.ai with IBM Observability by Instana APM and IBM Watson Studio, IBM is well-positioned to address the full spectrum of observability across IT operations.

For example, Databand.ai capabilities can alert data teams and engineers when the data they are using to fuel an analytics system is incomplete or missing. In common cases where data originates from an enterprise application, Instana can then help users quickly explain exactly where the missing data originated from and why an application service is failing. Together, Databand.ai and IBM Instana provide a more complete and explainable view of the entire application infrastructure and data platform system, which can help organizations prevent lost revenue and reputation.

“Our clients are data-driven enterprises who rely on high-quality, trustworthy data to power their mission-critical processes. When they don’t have access to the data they need in any given moment, their business can grind to a halt,” said Daniel Hernandez, General Manager for Data and AI, IBM. “With the addition of Databand.ai, IBM offers the most comprehensive set of observability capabilities for IT across applications, data and machine learning, and is continuing to provide our clients and partners with the technology they need to deliver trustworthy data and AI at scale.”

Data observability solutions are also a key part of an organization’s broader data strategy and architecture. The acquisition of Databand.ai further extends IBM’s existing data fabric solution  by helping ensure that the most accurate and trustworthy data is being put into the right hands at the right time – no matter where it resides.

“You can’t protect what you can’t see, and when the data platform is ineffective, everyone is impacted –including customers,” said Josh Benamram, Co-Founder and CEO, Databand.ai. “That’s why global brands such as FanDuel, Agoda and Trax Retail already rely on Databand.ai to remove bad data surprises by detecting and resolving them before they create costly business impacts. Joining IBM will help us scale our software and significantly accelerate our ability to meet the evolving needs of enterprise clients.”

Headquartered in Tel Aviv, Israel, Databand.ai employees will join IBM Data and AI, further building on IBM’s growing portfolio of Data and AI products, including its IBM Watson capabilities and IBM Cloud Pak for Data. Financial details of the deal were not disclosed. The acquisition closed on June 27, 2022.

To learn more about Databand.ai and how this acquisition enhances IBM’s data fabric solution and builds on its full stack of observability software, you can read our blog about the news or visit here: https://www.ibm.com/analytics/data-fabric.

About Databand.ai

Databand.ai is a product-driven technology company that provides a proactive data observability platform, which empowers data engineering teams to deliver reliable and trustworthy data. Databand.ai removes bad data surprises such as data incompleteness, anomalies, and breaking data changes by detecting and resolving issues before they create costly business impacts. Databand.ai’s proactive approach ties into all stages of your data pipelines, beginning with your source data, through ingestion, transformation, and data access. Databand.ai serves organizations throughout the globe, including some of the world’s largest companies in entertainment, technology, and communications. Our focus is on enabling customers to extract the maximum value from their strategic data investments. Databand.ai is backed by leading VCs Accel, Blumberg Capital, Lerer Hippeau, Differential Ventures, Ubiquity Ventures, Bessemer Venture Partners, Hyperwise, and F2. To learn more, visit www.databand.ai.

About IBM

IBM is a leading global hybrid cloud and AI, and business services provider, helping clients in more than 175 countries capitalize on insights from their data, streamline business processes, reduce costs and gain the competitive edge in their industries. Nearly 3,800 government and corporate entities in critical infrastructure areas such as financial services, telecommunications and healthcare rely on IBM’s hybrid cloud platform and Red Hat OpenShift to affect their digital transformations quickly, efficiently, and securely. IBM’s breakthrough innovations in AI, quantum computing, industry-specific cloud solutions and business services deliver open and flexible options to our clients. All of this is backed by IBM’s legendary commitment to trust, transparency, responsibility, inclusivity, and service. For more information, visit www.ibm.com.

Media Contact:
Sarah Murphy
IBM Communications
[email protected]

1 [1] Source: Smarter with Gartner, “How to Improve Your Data Quality,” Manasi Sakpal, [July 14, 2021]

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

How Databand Achieves Automated Data Lineage

Databand
2022-06-16 15:51:43

Data lineage seems to be the hot topic for data platform teams. In fact, we’re doing an upcoming webinar on how data lineage is viewed in the industry, and how a more end-to-end approach solves a lot of issues with lineage.

In this blog, we’re going to walk through how Databand provides automated data lineage so you can easily diagnose pipeline failures and analyze downstream impacts.

Watch the video to see it in action or continue reading below.

Analyze alerts

Using automated data lineage typically starts with an alert. You can jump right into a lineage graph, but it’s important to first know why the graph is relevant.  For example, on the Databand alert screen, you can see all the data incidents and their alerts in one view. 

This particular alert shows that a critical alert fired around our “daily_sales_ingestion” pipeline. Which is a business pipeline that processes our daily sales from SAP, does some transformations for different regions, and then sends it over into a BI layer.  

Needless to say, this pipeline is critical for our business since it processes sales from around and eventually shows the results to the business. 

To diagnose the alert, select view details, and now you are into an alert overview screen.

Understand impacted datasets

Before seeing the lineage graph, you can see the impact analysis across your affected datasets, pipelines, and operations. 

Impact Analysis

View data lineage

Once you’ve seen what had been impacted, you can now visualize these impacts by selecting the data lineage tab. This graph shows all the dependent relationships between the initial pipeline that failed and any other dependencies that are impacted. 

For example, we’re looking at tasks that are writing to a particular dataset and that same dataset being read by a subsequent task. All the red text in each pipeline represents anything that was impacted by the initial failed task. 

View data lineage

Let’s zoom to the specific pipeline that failed. Here you can see the specific task named “extract_regional_sales_to_S3” failed the pipeline. 

By selecting the failed task, you can see which specific downstream datasets or tasks are impacted with a highlighted red box.

Zoom in Data Lineage

Each time you select a different task, the graph will change which boxes display. 

For example, if you select the dataset named “S3 – North America Daily SAP Sales Extract” a lot of red text still remains but the red boxes have changed.

This indicates that the “S3 – North America Daily SAP Sales Extract” dataset only impacts the highlighted red boxes downstream.

You’ll notice that this dataset had no dependencies on a downstream pipeline in the EU or Asia, but does have dependencies in the North America pipeline labeled “na_sentiment_impact_analysis” and the “serve_sales_results_to_bi” pipeline that serves our BI layer. 

Impacted downstream

Quicky debug data incident

And to make debugging easier, you can jump directly to a task from the data lineage graph. Now you can see the error that caused the pipeline to fail. 

This allows you to quickly debug errors and resolve them before any downstream impacts occur.

Wrapping it up

For more information on how Databand can help you achieve automated data lineage, check out our demo center or book a demo.

What is Good Data Quality for Data Engineers?

Databand
2021-03-02 19:47:21

In theory, data quality is everyone’s problem. When it’s poor, it degrades marketing, product, customer success, brand-perception—everything. In theory, everyone should work together to fix it. But that’s in theory. 

In reality, you need someone to take ownership of the problem, investigate it, and tell others what to do. That’s where data engineers come in.

In this guide, the Databand team has compiled a resource for grappling with data quality issues within and around your pipeline – not in theory, but in practice. And that starts with a discussion of what exactly constitutes data quality for data engineers. 

Data quality challenges for data engineers

Their perennial challenge? That everyone involved in using the data has a different understanding of what “data” means. And it’s not really their fault.

The further someone is from the source of that data and the data pipelines that carry it, the more they tend to engage in magical thinking about how it can be used, if only for a lack of awareness. According to one data engineer we talked to when researching this guide, “Business leaders are always asking, ‘Hey, can we look at sales across this product category?’ when on the backend, it’s virtually impossible with the current architecture.”

The importance of observability

Similarly, businesses rely on the data from pipelines they can’t fully observe. Without accurate benchmarks or a seasoned professional who can sense that output values are off, you can be data-driven right off a cliff.

What are the four characteristics of data quality?

While academic conceptions of data quality provide an interesting foundation, we’ve found that for data engineers, it’s different. In diagnosing pipeline data quality issues for dozens of high-volume organizations over the last few years, engineers need a simpler and more credible map. Only with that map can you begin to conceptualize systems that will keep it in proper order.

We’ve condensed the typical 6-7 data quality dimensions (you will find hundreds of variants online) into just four:

  • Fitness
  • Lineage
  • Governance
  • Stability

We also prefer the term “data health” to “data quality,” because it suggests it’s an ongoing system that must be managed. Without checkups, pipelines can grow sick and stop working.

Dimension 1: Fitness

Is this data fit for its intended use?

The operative word here is “intended.” No two companies’ uses are identical, so fitness is always in the eye of the beholder. To test fitness, take a random sample of records and test how they perform for your intended use.

Within fitness, look at:

  • Accuracy—does the data reflect reality? (Within reason. As they say, all models are wrong. Some are useful.)
  • Integrity—does the fitness remain high through the data’s lifecycle? (It’s a simple equation: Integrity = quality / time)

Dimension 2: Lineage

Where did this data come from? When? Where did it change? Is it where it needs to be?

Lineage is your timeline. It helps you understand whether your data health problem starts with your provider. If it’s fit when it enters your pipeline and unfit when it exits, that’s useful information. 

Within lineage, look at:

  • Source—is my data source provider behaving well? E.g. Did Facebook change an API?
  • Origin—where did the data already in my database come from? E.g. Perhaps you’re not sure who put it there.

Dimension 3: Governance

Can you control it? 

These are the levers you can pull to move, restrict, or otherwise control what happens to your data. It’s the procedural stuff, like loads and transformations, but also security and access. 

Within governance, look at:

  • Data controls—how do we identify which data should be governed and which should be open? What should be available to data scientists and users? What shouldn’t?
  • Data privacy—where is there currently personally identifiable info (PII)? Can we automatically redact PII like phone numbers? Can we ensure that a pipeline that accidentally contains PII fails or is killed?
  • Regulation—can we track regulatory requirements, ensure we’re compliant, and prove we’re compliant if a regulator wants to know? (Under GDPR, CCPA, NY SHIELD, etc.)
  • Security—who has access to the data? Can I control it? With enough granularity?

Dimension 4: Stability

Is the data complete and available in the right frequency?

Your data may be fit, meaning your downstream systems function, but is it as accurate as it could be, and is that consistently the case? If your data is fit, but the accuracy varies widely, or it’s only available in monthly batch updates and you need it hourly, it’s not stable. 

Stability is one of the biggest areas where data observability tools can help. Pipelines are often a black box unless you can monitor what happens inside and get alerts.

To check stability, check against a benchmark dataset. 

Within stability, look at:

  • Consistency—does the data going in match the data going out? If it appears in multiple places, does it mean the same thing? Are weird transformations happening at predictable points in the pipeline?
  • Dependability—the data is present when needed. E.g. If I build a dashboard, it behaves properly and I don’t get calls from leadership.
  • Timeliness—is it on time? E.g. If you pay NASDAQ for daily data, are they providing fresh data on a daily basis? Or is it an internal issue?
  • Bias—is there bias in the data? Is it representative of reality? Take, for example, seasonality in the data. If you train a model for predicting consumer buying behavior and you use a dataset from November to December, you’re going to have unrealistically high sales predictions.

Now, bias of this sort isn’t completely imperceptible—some observability platforms (Databand being one of them) have anomaly detection for this reason. When you have seasonality in your data, you have seasonality in your data requirements, and thus seasonality in your data pipeline behavior. You should be able to automatically account for that.

Quality data is balanced data

Good data quality for data engineers is when you have a data pipeline set up to ensure all four data quality dimensions: fitness, lineage, governance, and stability. But you must address all four.

As a data engineer, you cannot tackle one dimension of data quality without tackling all. That may seem rather inconvenient given that most engineers are inheriting data pipelines rather than building them from scratch. But such is the reality. 

If you optimize for one dimension—say, stability—you may be loading data that hasn’t yet been properly transformed, and fitness can suffer. The data quality dimensions exist in equilibrium.

How to balance all four dimensions of data quality

graphic displaying the four dimensions of data quality: 1. Fitness 2. Lineage 3. Governance 4. Stability

To achieve a proper balance for data health, you need:

Data quality controls

What systems do you have for manipulating, protecting, and governing your data? With high-volume pipelines, it is not enough to trust and verify.

Data quality testing

What systems do you have for measuring fitness, lineage, governance, and stability? Things will break. You must know where, and why. 

Systems to identify data quality issues

If issues do occur—if a pipeline fails to run, or the result is aberrant—do you have anomaly detection to alert you? Or if PII makes it into a pipeline, does the pipeline auto-fail to protect you from violating regulation?

In short, you need a high level of data observability, paired with the ability to act continuously.

Common data pipeline data quality issues

As a final thought, when you’re diagnosing your data pipeline issues, it’s important to draw a distinction between a problem and its root cause. Your pipeline may have failed to complete. The proximal cause could have been an error in a Spark job. But the root cause? A corruption in the dataset. If you aren’t addressing issues in the dataset, you’ll be forever addressing issues.

Examples of common data pipeline quality issues: 

  • Non-unicode characters
  • Unexpected transforms
  • Mismatched data in a migration or replication process
  • Pipelines missing their SLA, or running late
  • Pipelines that are too resource-intensive or costly
  • Finding the root cause of issues
  • Error in a Spark job, corruption in a data set
  • A big change in your data volume or sizes

The more detail you get from your monitoring tool, the better. It’s common to discover proximal causes quickly, but then take days to discover the root cause through a taxing, manual investigation. Sometimes, your pipeline workflow management tool tells you everything is okay but a quick glance at the output reassures you nothing is okay, because the values are all blank. For instance, Airflow may tell you the pipeline succeeded, but no data actually passed through. Your code ran fine—Airflow gives you a green light, you’re good—but on the data level, it’s entirely unfit.

Constant checkups and being able to peer deeply into your pipeline to know the right balance of fitness, lineage, governance, and stability to produce high-quality data. And high-quality data is how you support an organization in practice, not just in theory. 

Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. Sign up for a free trial or request a demo to learn more.