Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

What is Data Lineage?

Databand
2022-07-28 10:20:00

The term “data lineage” has been thrown around a lot over the last few years.

What started as an idea of connecting between datasets quickly became a very confusing term that now gets misused often.

It’s time to put order to the chaos and dig deep into what it really is. Because the answer matters quite a lot. And getting it right matters even more to data organizations. 

This article will unpack everything you need to know about data lineage, including:

  • What is it?
  • What’s the difference between data lineage and data provenance?
  • Why is it important?
  • What are common data lineage techniques?
  • What are data lineage best practices?
  • What is end-to-end lineage vs. data at rest lineage?
  • What are the benefits of end-to-end data lineage?
  • What should you look for in a data lineage tool?

What is data lineage?

Its purpose is to track data throughout its complete lifecycle. It looks at the data source to its end location and notes any changes (including what changed, why it changed, and how it changed) along the way. And it does all of this visually.

Usually, it provides value in two key areas:

  1. Development process: Knowing what affects what and what could be the impact of making changes. 
  2. Debugging process: Understanding the severity, impact, and root cause of issues.

In general, it makes it possible to identify errors in data, reduce the risk associated with system and process changes, and increase trust in data. All of these are essential at a time when data plays such an integral role in business outcomes and decision-making.

Data lineage in action: A simplified example

When data engineers talk about it, they often imagine a data observability platform that allows them to understand the logical relationship between datasets that are affecting each other in a specific business flow.

data lineage

In this very simplified example, we can see an ELT:

  • Some pipeline tasks, probably running by Airflow, are scraping external data sources and collecting data from there.
  • Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse).
  • Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.
  • Dashboarding tools such as Tableau, Looker, or Power BI are being used on top of the business data and providing visibility to multiple stakeholders.

What’s the difference between data lineage and data provenance?

Data lineage and data provenance are often viewed as one and the same. While the two are closely related, there is a difference.

Whereas data lineage tracks data throughout the complete lifecycle, data provenance zooms in on the data origin. It provides insight into where data comes from and how it gets created by looking at important details like inputs, entities, systems, and processes for the data.

Data provenance can help with error tracking when understanding data lineage and can also help validate data quality.

Why is it important?

As businesses use more big data in more ways, having confidence in that data becomes increasingly important – just look at Elon Musk’s deal to buy Twitter for an example of trust in data gone wrong. Consumers of that data need to be able to trust in its completeness and accuracy and receive insights in a timely manner. This is where data lineage comes into play.

Data lineage instills this confidence by providing clear information about data origin and how data has moved and changed since then. In particular, it is important to key activities like:

  • Data governance: Understanding the details of who has viewed or touched data and how and when it was changed throughout its lifecycle is essential to good data governance. Data lineage provides that understanding to support everything from regulatory compliance to risk management around data breaches. This visibility also helps ensure data is handled in accordance with company policies.
  • Data science and data analytics: Data science and data analytics are critical functions for organizations that are using data within their business models, and powering strong data science and analytics programs requires a deep understanding of data. Once again, data lineage offers the necessary transparency into the data lifecycle to allow data scientists and analysts to work with the data and identify its evolutions over time. For instance, data lineage can help train (or re-train) data science models based on new data patterns.
  • IT operations: If teams need to introduce new software development processes, update business processes, or adjust data integrations, understanding any impact to data along the way –  as well as where data might need to come from to support those processes – is essential. Data lineage not only delivers this visibility, but it can also reduce manual processes associated with teams tracking down this information or working through data silos.
  • Strategic decision making: Any organization that relies on data to power strategic business decisions must have complete trust that the data they’re using is accurate, complete, and timely. Data lineage can help instill that confidence by showing a clear picture of where data has come from and what happened to it as it moved from one point to another.
  • Diagnosing issues: Should issues arise with data in any way, teams need to be able to identify the cause of the problem quickly so that they can fix it. The visibility provided by data lineage can help make this possible by allowing teams to visualize the path data has taken, including who has touched it and how and when it changed.

What are common techniques?

There are several commonly used techniques for data lineage that collect and store information about data throughout its lifecycle to allow for a visual representation. These techniques include:

  • Pattern-based lineage: Evaluates metadata for patterns in tables, columns, and reports rather than relying on any code to perform data lineage. This technique focuses directly on the data (vs. algorithms), making it technology-agnostic; however, it is not always the most accurate technique.
  • Self-contained lineage: Tracks data movement and changes in a centralized system, like a data lake that contains data throughout its entire lifecycle. While this technique eliminates the need for any additional tools, it does have a major blind spot to anything that occurs outside of the environment at hand.
  • Lineage by data tagging: A transformation engine that tags every movement or change in data allows for lineage by data tagging. The system can then read those tags to visualize the data lineage. Similar to self-contained lineage, this technique only works for contained systems, as the tool used to create the tags will only be able to look within a single environment.
  • Lineage by parsing: An advanced form of data lineage that reads logic used to process data. Specifically, it provides end-to-end tracing by reverse engineering data transformation logic. This technique can get complicated quickly, as it requires an understanding of all the programming logic used throughout the data lifecycle (e.g. SQL, ETL, JAVA, XML, etc.).

What are data lineage best practices?

When it comes to introducing and managing data lineage, there are several best practices to keep in mind:

  • Automate data lineage extraction: Manual data lineage centered around spreadsheets is no longer an option. Capturing the dynamic nature of data in today’s business environments requires an automated solution that can keep up with the pace of data and reduce the errors associated with manual processes.
  • Bring metadata source into data lineage: Systems that handle data, like ETL software and database management tools, all create metadata – or data about the data they handle (meta, right?). Bringing this metadata source into data lineage is critical to gaining visibility into how data was used or changed and where it’s been throughout its lifecycle.
  • Communicate with metadata source owners: Staying in close communication with the teams that own metadata management tools is critical. This communication allows for verification of metadata (including its timeliness and accuracy) with the teams that know it best.
  • Progressively extract metadata and lineage: Progressive extraction – or extracting metadata and lineage in the same order as it moves through systems – makes it easier to do activities like mapping relationships, connections, and dependencies across the data and systems involved.
  • Progressively validate end-to-end data lineage: Validating data lineage is important to make sure everything is running as it should. Doing this validation progressively by starting with high-level system connections, moving to connected datasets, then elements, and finishing off with transformation documentation simplifies the process and allows it to flow more logically.
  • Introduce a data catalog: Data catalog software makes it possible to collect data lineage across sources and extract metadata, allowing for end-to-end data lineage.

What is end-to-end lineage vs. data at rest lineage?

When talking about lineage, most conversations usually tackle the scenario of data “in-the-warehouse,” which presumes everything is occurring in a contained data warehouse or data lake. In these cases, it monitors data executions that are performed on specific or multiple tables to extract the relationship within or among them. 

At Databand, we refer to this as “data at rest lineage,” since it observes the data after it was already loaded into the warehouse.

This data at rest lineage can be troublesome for modern data organizations, which typically have a variety of stakeholders (think: data scientist, analyst, end customer), each of which has very specific outcomes they’re optimizing toward. As a result, they each have different technologies, processes, and priorities and are usually siloed from one another. Data at rest lineage that looks at data within a specific data warehouse or data lake typically doesn’t work across these silos or data integrations.

Instead, what organizations need is end-to-end data lineage, which looks at how data moves across data warehouses and data lakes to show the true, complete picture.

Consider the case of a data engineer who owns end-to-end processes within dozens of dags in different technologies. If that engineer encounters corrupted data, they want to know the root cause. They want to be able to proactively catch issues before they land on business dashboards and to track the health of the different sources on which they rely. Essentially, they want to be able to monitor the real flow of the data.

With this type of end-to-end lineage, they could see that a SQL query has introduced corrupted data to a column in a different table or that a DBT test failure has affected other analysts’ dashboards. In doing so, end-to-end lineage captures data in motion, resulting in a visual similar to the following:

data lineage

What are the benefits of end-to-end data lineage?

Modern organizations need true end-to-end lineage because it’s no longer enough just to monitor a small part of the pipeline. While data at rest lineage is easy to integrate, it provides very low observability across the entire system.

Additionally, data at rest lineage is limited across development languages and technologies. If everything is SQL-based, that’s one thing. But the reality is, modern data teams will use a variety of languages and technologies for different needs that don’t get covered with the more siloed approach.

As if that wasn’t enough, most of the issues with data happen before it ever reaches the data warehouse, but data at rest lineage won’t capture those issues. If teams did have that visibility though, they could catch issues sooner and proactively protect business data from corruption.

End-to-end data lineage solves these challenges and delivers several notable benefits, including:

  • Clear visibility on impact: If there’s a schema change in the external API from which Python fetches data, teams need true end-to-end visibility to know which business dashboard will be affected. Gaining that visibility requires understanding the path of data in motion across environments and systems – something only end-to-end data lineage that tracks data in motion can provide.
  • Understanding of root cause: By the time an issue hits a table used by analysts, the problem is already well underway, stemming from further back in the data lifecycle. With data at rest lineage, it’s only possible to see what’s happening in that particular table, though – which isn’t helpful for identifying the cause of the issue. End-to-end lineage, on the other hand, can look across the complete lifecycle to provide clarity into the root cause of issues, wherever they turn up.
  • Ability to connect between pipelines and datasets: In a very complex environment where thousands of pipelines (or more!) are writing and reading data from thousands of datasets, the ability to identify which pipeline is working on a weekly, daily, or hourly bases and with which tables (or even specific columns within tables) is a true game-changer.

What should you look for in a data lineage tool?

As it becomes increasingly important, what should you look for in a data lineage tool? 

Above all else, you need a tool that can power end-to-end data lineage (vs. data at rest lineage). You also need a solution that can automate the process, as manual data lineage simply won’t cut it anymore.

With those prerequisites in mind, other capabilities to consider when evaluating a data lineage tool include:

  • Alerts: Automated alerts should allow you to not just identify that an incident has occurred, but gain context on that incident before jumping into the details. This context might include high-level details like the data pipeline experiencing an issue and the severity of the issue.
  • View of affected datasets: The ability to see all of the datasets impacted by a particular issue in a single, birds-eye view is helpful to understanding the effect on operations and the severity of the issue.
  • Visual of data lineage: Visualizing data lineage by seeing a graph of relationships between the data pipeline experiencing the issue and its dependencies allows you to gain a deeper understanding of what’s happening and what’s affected as a result. The ability to click into tasks and see the dependencies and impact to each one for a given task provides even more clarity when it comes to issue resolution.
  • Debugging within tasks: Finally, the ability to see specific errors within specific tasks allows for quick debugging of issues for faster resolution.

Getting it right

Data lineage isn’t a new concept, but it is one that’s often misunderstood. However, as data becomes more critical to more areas of business, getting it right is increasingly important.

It requires an understanding of exactly what data lineage is and why it’s so important. Additionally, it requires a thoughtful approach to addressing data lineage that matches the needs of a modern data organization – which means true end-to-end data lineage. And finally, it requires the right tool to support this end-to-end lineage in an automated way.

implement end-to-end data lineage

What is Data Reliability and How Observability Can Help

Databand
2022-07-26 10:30:00

Data matters more than ever – we all know that. But at a time when being a data-driven business is so critical, how much can we trust data and what it tells us? That’s the question behind data reliability, which focuses on having complete and accurate data that people can trust. This article will explore everything you know about data reliability and the important role of data observability along the way, including:

  • What is data reliability?
  • Why is it important?
  • How do you measure it?
  • Data quality vs. data reliability: What’s the difference?
  • What is a data quality framework?
  • How can observability help improve data reliability?
  • Top data reliability testing tools

What is data reliability?

Data reliability looks at the completeness and accuracy of data, as well as its consistency across time and sources. The consistency piece is particularly important, as data needs to be consistent to be truly reliable, that way it’s always trustworthy.

Data reliability is one element of data quality. 

Specifically, it helps build trust in data. It’s what allows us to make data-driven decisions and take action confidently based on data. The value of that trust is why more and more companies are introducing Chief Data Officers – with the number doubling among the top publicly traded companies between 2019 and 2021, according to PwC.

Why is data reliability important?

Data reliability is so important because, without it, businesses either wouldn’t be able to trust their data to power any kind of decision or action. Even worse, businesses could trust inaccurate data and make ill-informed decisions that could have a negative impact. 

Depending on the business and the decision, this negative impact could mean millions of wasted dollars or even risk to customers. In fact, Gartner estimates bad data costs businesses nearly $13 million annually.

As HBR puts it: “An incorrect laboratory measurement in a hospital can kill a patient. An unclear product spec can add millions of dollars in manufacturing costs. An inaccurate financial report can turn even the best investment sour. The reputational consequences of such errors can be severe.”

Additionally, once unreliable data does surface, the longer it goes unnoticed, the more expensive and time-consuming it can be to solve the issue. Issue resolution also diverts valuable resources away from other projects.

On the flip side, prioritizing data reliability can result in a competitive advantage by allowing teams to make more insightful decisions. In fact, research reveals that data-driven organizations are 23 times more likely to acquire customers, 6 times as likely to retain customers, and 19 times as likely to be profitable.

How do you measure data reliability?

Measuring data reliability requires looking at three core factors:

  1. Is it valid? Validity of data looks at whether or not it’s stored and formatted in the right way. This is largely a data quality check.
  2. Is it complete? Completeness of data identifies if anything is missing from the information. While data can be valid, it might still be incomplete if critical fields are not present that could change someone’s understanding of the information.
  3. Is it unique? The uniqueness of data checks for any duplicates in the data set. This uniqueness is important to avoid over-representation, which would be inaccurate.

To take it one step further, some teams also consider factors like:

  • If and when the data source was modified
  • What changes were made to data
  • How often the data has been updated
  • Where the data originally came from
  • How many times the data has been used

Overall, measuring data reliability is essential to not just help teams trust their data, but also to identify potential issues early on. Regular and effective data reliability assessments based on these measures can help teams quickly pinpoint issues to determine the source of the problem and take action to fix it. Doing so makes it easier to resolve issues before they become too big and ensures organizations don’t use unreliable data for an extended period of time.

Data quality vs. data reliability: What’s the difference?

All of this information begs the question: What’s the difference between data quality vs. data reliability?

Quite simply, data reliability is part of the bigger data quality picture. Data quality takes on a much bigger focus than reliability, looking at elements like completeness, consistency, conformity, accuracy, integrity, timeliness, continuity, availability, reproducibility, searchability, comparability, and – you guessed it – reliability.

For data engineers, there are typically four data quality dimensions that matter most:

  • Fitness: Is the data fit for its intended use, which considers accuracy and integrity throughout its lifecycle.
  • Lineage: Where and when did the data come from and where did it change, which looks at source and origin.
  • Governance: Can you control the data, which takes into account what should and shouldn’t be controllable and by whom, as well as privacy, regulations, and security.
  • Stability: Is the data complete and available in the right frequency, which includes consistency, dependability, timeliness, and bias.

Fitness, lineage, and stability all have elements of data reliability throughout them. Although taken as a whole, data quality clearly encompasses a much larger picture than data reliability.

What is a data quality framework?

A data quality framework allows organizations to define relevant data quality attributes and provide guidance for processes to continuously ensure data quality meets expectations. For example, using a data quality framework can build trust in data by ensuring what team members view is always accurate, up to date, ready on time, and consistent.

A good data quality framework is actually a cycle, which typically involves six steps largely led by data engineers:

  1. Qualify: Understand a list of requirements based on what the end consumers of the data need.
  2. Quantify: Establish quantifiable measures of data quality based on the list of requirements.
  3. Plan: Build checks on those data quality measures that can run through a data observability platform.
  4. Implement: Put the checks into practice and test that they work as expected.
  5. Manage: Confirm the checks also work against historical pipeline data and, if so, put them into production.
  6. Verify: Check with data engineers and data scientists that the work has improved performance and delivers the desired results, and check that the end consumers of the data are getting what they need.

How can observability help improve data reliability?

Data observability is about understanding the health and state of data in your system. It includes a variety of activities that go beyond just describing a problem. Data observability can help identify, troubleshoot, and resolve data issues in near real-time.

Importantly, data observability is essential to getting ahead of bad data issues, which sit at the heart of data reliability. Looking deeper, data observability encompasses activities like monitoring, alerting, tracking, comparisons, analyses, logging, and SLA tracking, all of which work together to understand end-to-end data quality – including data reliability.

When done well, data observability can help improve data reliability by making it possible to identify issues early on to respond faster, understand the extent of the impact, and restore reliability faster as a result of this insight.

Top data reliability testing tools

Understanding the importance of data reliability, how it sits within a broader data quality framework, and the importance of data observability is a critical first step. Next, taking action to invest in it requires the right technology.

With that in mind, here’s a look at the top data reliability testing tools available to data engineers. It’s also important to note that some of these solutions are often referred to as data observability tools since better observability leads to better reliability.

1) Databand

Databand is a data observability platform that helps teams monitor and control data quality by isolating and triaging issues at their source. With Databand, you can know what to expect from data by identifying trends, detecting anomalies, and visualizing data reads. This allows a team to easily alert the right people in real time about issues like missing data deliveries, unexpected data schemes, and irregular data volumes and sizes.

2) Datadog

Datadog’s observability platform provides visibility into the health and performance of each layer of your environment at a glance. It allows you to see across systems, apps, and services with customizable dashboards that support alerts, threat detection rules, and AI-powered anomaly detection. Their solution for 

3) Great Expectations

Great Expectations offers a shared, open standard for data quality. It makes data documentation clean and human-readable, all with the goal of helping data teams eliminate pipeline debt through data testing, documentation, and profiling.

4) New Relic

New Relic’s data observability platform offers full-stack monitoring of network infrastructure, applications, machine learning models, end-user experiences, and more, with AI assistance throughout. They also have solutions specifically geared towards AIOps observability. 

5) Bigeye

Bigeye offers a data observability platform that focuses on monitoring data, rather than data pipelines. Specifically, it monitors data freshness, volume, formats, categories, outliers, and distributions in a single dashboard. It also uses machine learning to set forecasting for alert thresholds.

6) Datafold

Datafold offers data reliability with features like regression testing, anomaly detection, and column-level lineage. They also have an open-source command-line tool and Python library to efficiently diff rows across two different databases.

In addition to these five tools, others available include PagerDuty, Datafold, Monte Carlo, Cribl, Soda, and Unravel.

Make Data Reliability a Priority

The risks of bad data combined with the competitive advantages of quality data mean that data reliability must be a priority for every single business. To do so, it’s important to understand what’s involved in assessing and improving reliability (hint: it comes down in large part to data observability) and then to set clear responsibilities and goals for improvement.

data reliability

IBM Acquires Databand to Extend Leadership in Observability

Databand
2022-07-06 08:14:51

Today is a big day for the Databand community! 

We’re excited to announce that Databand has been acquired by IBM to extend its leadership in observability to the full stack of capabilities for IT — across infrastructure, applications, data, and machine learning. 

This is beyond exciting news for our team, our customers, and the broader data observability market. 

Click the link to the official press release, read the transcript below, or request a demo to see Databand in action.

IBM Aims to Capture Growing Market Opportunity for Data Observability with Databand.ai Acquisition

Acquisition helps enterprises catch "bad data" at the source
Extends IBM's leadership in observability to the full stack of capabilities for IT -- across infrastructure, applications, data and machine learning

ARMONK, N.Y.July 6, 2022  /PRNewswire/ — IBM (NYSE: IBM) today announced it has acquired Databand.ai, a leading provider of data observability software that helps organizations fix issues with their data, including errors, pipeline failures and poor quality — before it impacts their bottom-line. Today’s news further strengthens IBM’s software portfolio across data, AI and automation to address the full spectrum of observability and helps businesses ensure that trustworthy data is being put into the right hands of the right users at the right time.

Databand.ai is IBM’s fifth acquisition in 2022 as the company continues to bolster its hybrid cloud and AI skills and capabilities. IBM has acquired more than 25 companies since Arvind Krishna became CEO in April 2020.

As the volume of data continues to grow at an unprecedented pace, organizations are struggling to manage the health and quality of their data sets, which is necessary to make better business decisions and gain a competitive advantage. A rapidly growing market opportunity, data observability is quickly emerging as a key solution for helping data teams and engineers better understand the health of data in their system and automatically identify, troubleshoot and resolve issues, like anomalies, breaking data changes or pipeline failures, in near real-time. According to Gartner, every year poor data quality costs organizations an average $12.9 million. To help mitigate this challenge, the data observability market is poised for strong growth.1

Data observability takes traditional data operations to the next level by using historical trends to compute statistics about data workloads and data pipelines directly at the source, determining if they are working, and pinpointing where any problems may exist. When combined with a full stack observability strategy, it can help IT teams quickly surface and resolve issues from infrastructure and applications to data and machine learning systems.

Databand.ai’s open and extendable approach allows data engineering teams to easily integrate and gain observability into their data infrastructure. This acquisition will unlock more resources for Databand.ai to expand its observability capabilities for broader integrations across more of the open source and commercial solutions that power the modern data stack. Enterprises will also have full flexibility in how to run Databand.ai, whether as-a-Service (SaaS) or a self-hosted software subscription.

The acquisition of Databand.ai builds on IBM’s research and development investments as well as strategic acquisitions in AI and automation. By using Databand.ai with IBM Observability by Instana APM and IBM Watson Studio, IBM is well-positioned to address the full spectrum of observability across IT operations.

For example, Databand.ai capabilities can alert data teams and engineers when the data they are using to fuel an analytics system is incomplete or missing. In common cases where data originates from an enterprise application, Instana can then help users quickly explain exactly where the missing data originated from and why an application service is failing. Together, Databand.ai and IBM Instana provide a more complete and explainable view of the entire application infrastructure and data platform system, which can help organizations prevent lost revenue and reputation.

“Our clients are data-driven enterprises who rely on high-quality, trustworthy data to power their mission-critical processes. When they don’t have access to the data they need in any given moment, their business can grind to a halt,” said Daniel Hernandez, General Manager for Data and AI, IBM. “With the addition of Databand.ai, IBM offers the most comprehensive set of observability capabilities for IT across applications, data and machine learning, and is continuing to provide our clients and partners with the technology they need to deliver trustworthy data and AI at scale.”

Data observability solutions are also a key part of an organization’s broader data strategy and architecture. The acquisition of Databand.ai further extends IBM’s existing data fabric solution  by helping ensure that the most accurate and trustworthy data is being put into the right hands at the right time – no matter where it resides.

“You can’t protect what you can’t see, and when the data platform is ineffective, everyone is impacted –including customers,” said Josh Benamram, Co-Founder and CEO, Databand.ai. “That’s why global brands such as FanDuel, Agoda and Trax Retail already rely on Databand.ai to remove bad data surprises by detecting and resolving them before they create costly business impacts. Joining IBM will help us scale our software and significantly accelerate our ability to meet the evolving needs of enterprise clients.”

Headquartered in Tel Aviv, Israel, Databand.ai employees will join IBM Data and AI, further building on IBM’s growing portfolio of Data and AI products, including its IBM Watson capabilities and IBM Cloud Pak for Data. Financial details of the deal were not disclosed. The acquisition closed on June 27, 2022.

To learn more about Databand.ai and how this acquisition enhances IBM’s data fabric solution and builds on its full stack of observability software, you can read our blog about the news or visit here: https://www.ibm.com/analytics/data-fabric.

About Databand.ai

Databand.ai is a product-driven technology company that provides a proactive data observability platform, which empowers data engineering teams to deliver reliable and trustworthy data. Databand.ai removes bad data surprises such as data incompleteness, anomalies, and breaking data changes by detecting and resolving issues before they create costly business impacts. Databand.ai’s proactive approach ties into all stages of your data pipelines, beginning with your source data, through ingestion, transformation, and data access. Databand.ai serves organizations throughout the globe, including some of the world’s largest companies in entertainment, technology, and communications. Our focus is on enabling customers to extract the maximum value from their strategic data investments. Databand.ai is backed by leading VCs Accel, Blumberg Capital, Lerer Hippeau, Differential Ventures, Ubiquity Ventures, Bessemer Venture Partners, Hyperwise, and F2. To learn more, visit www.databand.ai.

About IBM

IBM is a leading global hybrid cloud and AI, and business services provider, helping clients in more than 175 countries capitalize on insights from their data, streamline business processes, reduce costs and gain the competitive edge in their industries. Nearly 3,800 government and corporate entities in critical infrastructure areas such as financial services, telecommunications and healthcare rely on IBM’s hybrid cloud platform and Red Hat OpenShift to affect their digital transformations quickly, efficiently, and securely. IBM’s breakthrough innovations in AI, quantum computing, industry-specific cloud solutions and business services deliver open and flexible options to our clients. All of this is backed by IBM’s legendary commitment to trust, transparency, responsibility, inclusivity, and service. For more information, visit www.ibm.com.

Media Contact:
Sarah Murphy
IBM Communications
[email protected]

1 [1] Source: Smarter with Gartner, “How to Improve Your Data Quality,” Manasi Sakpal, [July 14, 2021]

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

How Databand Achieves Automated Data Lineage

Databand
2022-06-16 15:51:43

Data lineage seems to be the hot topic for data platform teams. In fact, we’re doing an upcoming webinar on how data lineage is viewed in the industry, and how a more end-to-end approach solves a lot of issues with lineage.

In this blog, we’re going to walk through how Databand provides automated data lineage so you can easily diagnose pipeline failures and analyze downstream impacts.

Watch the video to see it in action or continue reading below.

Analyze alerts

Using automated data lineage typically starts with an alert. You can jump right into a lineage graph, but it’s important to first know why the graph is relevant.  For example, on the Databand alert screen, you can see all the data incidents and their alerts in one view. 

This particular alert shows that a critical alert fired around our “daily_sales_ingestion” pipeline. Which is a business pipeline that processes our daily sales from SAP, does some transformations for different regions, and then sends it over into a BI layer.  

Needless to say, this pipeline is critical for our business since it processes sales from around and eventually shows the results to the business. 

To diagnose the alert, select view details, and now you are into an alert overview screen.

Understand impacted datasets

Before seeing the lineage graph, you can see the impact analysis across your affected datasets, pipelines, and operations. 

Impact Analysis

View data lineage

Once you’ve seen what had been impacted, you can now visualize these impacts by selecting the data lineage tab. This graph shows all the dependent relationships between the initial pipeline that failed and any other dependencies that are impacted. 

For example, we’re looking at tasks that are writing to a particular dataset and that same dataset being read by a subsequent task. All the red text in each pipeline represents anything that was impacted by the initial failed task. 

View data lineage

Let’s zoom to the specific pipeline that failed. Here you can see the specific task named “extract_regional_sales_to_S3” failed the pipeline. 

By selecting the failed task, you can see which specific downstream datasets or tasks are impacted with a highlighted red box.

Zoom in Data Lineage

Each time you select a different task, the graph will change which boxes display. 

For example, if you select the dataset named “S3 – North America Daily SAP Sales Extract” a lot of red text still remains but the red boxes have changed.

This indicates that the “S3 – North America Daily SAP Sales Extract” dataset only impacts the highlighted red boxes downstream.

You’ll notice that this dataset had no dependencies on a downstream pipeline in the EU or Asia, but does have dependencies in the North America pipeline labeled “na_sentiment_impact_analysis” and the “serve_sales_results_to_bi” pipeline that serves our BI layer. 

Impacted downstream

Quicky debug data incident

And to make debugging easier, you can jump directly to a task from the data lineage graph. Now you can see the error that caused the pipeline to fail. 

This allows you to quickly debug errors and resolve them before any downstream impacts occur.

Wrapping it up

For more information on how Databand can help you achieve automated data lineage, check out our demo center or book a demo.

The Impact of Bad Data and Why Observability is Now Imperative

Databand
2022-06-02 14:09:07

Think the impact of bad data is just a minor inconvenience? Think again. 

Bad data cost Unity, a publicly-traded video game software development company, $110 million.

And that’s only the tip of the iceberg.

The Impact of Bad Data: A Case Study on Unity

Unity stock dropped 37% on May 11, 2022, after the company announced its first-quarter earnings, despite strong revenue growth, decent margins, good customer growth, and continued high performance in dollar-based net expansion. 

But there was one data point in Unity’s earnings that were not as positive. 

The company also shared that its Operate revenue growth was still up but had slowed due to a fault in its platform that reduced the accuracy of its Audience Pinpointer tool.

The fault in Unity’s platform? Bad data

Unity ingested bad data from a large customer into its machine learning algorithm, which helps place ads and allows users to monetize their games. This not only resulted in decreased growth but also ruined the algorithm, forcing the company to fix it to remedy the problem going forward.

The company’s management estimated the impact on the business at approximately $110 million in 2022.

Unity Isn’t Alone: The Impact of Bad Data is Everywhere

Unity isn’t the only company that has felt the impact of bad data deeply.

Take Twitter.

On April 25, 2022, Twitter accepted a deal to be purchased by Tesla and SpaceX founder Elon Musk. A mere 18 days later, Musk shared that the deal was “on hold” as he confirmed the number of fake accounts and bots on the platform. 

What ensued demonstrates the deep impact of bad data on this extremely high-profile deal for one of the world’s most widely-used speech platforms. Notably, Twitter has battled this data problem for years. In 2017, Twitter admitted to overstating its user base for several years, and in 2016 a troll farm used more than 50,000 bots to try to sway the US presidential election. Twitter first acknowledged fake accounts during its 2013 IPO.

Now, this data issue is coming to a head, with Musk investigating Twitter’s claim that fake accounts represent less than 5% of the company’s user base and angling to reduce the previously agreed upon purchase price as a result.

Twitter, like Unity, is another high-profile example of the impact of bad data, but examples like this are everywhere – and it costs companies millions of dollars. 

Gartner estimates that bad data costs companies nearly $13 million per year, although many don’t even realize the extent of the impact. Meanwhile, Harvard Business Review finds that knowledge workers spend about half of their time fixing data issues. Just imagine how much effort they could devote elsewhere if issues weren’t so prevalent.

Overall, bad data can lead to missed revenue opportunities, inefficient operations, and poor customer experiences, among other issues that add up to that multi-million dollar price tag.

Why Observability is Now Imperative for the C-Suite

The fact that bad data costs companies millions of dollars each year is bad enough. The fact that many companies don’t even realize this because they don’t measure the impact is potentially even worse. After all, how can you ever fix something of which you’re not fully aware?

Getting ahead of bad data issues requires data observability, which encompasses the ability to understand the health of data in your systems. Data observability is the only way that organizations can truly understand not only the impact of any bad data but also the causes of it – both of which are imperative to fixing the situation and stemming the impact.

It’s also important to embed data observability at every point possible with the goal of finding issues sooner in the pipeline rather than later because the further those issues progress, the more difficult (and more expensive) they become to fix.

Critically, this observability must be an imperative for C-suite leaders, as bad data can have a serious impact on company revenue (just ask Unity and Twitter). Making data observability a priority for the C-suite will help the entire organization – not just data teams – rally around this all-important initiative and make sure it becomes everyone’s responsibility.

This focus on end-to-end data observability can ultimately help:

  • Identify data issues earlier on in the data pipeline to stem their impact on other areas of the platform and/or business
  • Pinpoint data issues more quickly after the pop up to help arrive at solutions faster
  • Understand the extent of data issues that exist to get a complete picture of the business impact

In turn, this visibility can help companies recover more revenue faster by taking the necessary steps to mitigate bad data. Hopefully, the end result is a fix before the issues end up costing millions of dollars. And the only way to make that happen is if everyone, starting with the C-suite, prioritizes data observability.

impact of bad data

What is Dark Data and How it Causes Data Quality Issues

Databand
2022-05-31 17:11:25

We’re all guilty of holding onto something that we’ll never use. Whether it’s old pictures on our phones, items around the house, or documents at work, there’s always that glimmer of thought that we just might need it one day.

It turns out businesses are no different. But in the business setting, it’s not called hoarding, it’s called dark data.

Simply put, dark data is any data that an organization acquires and stores during regular business activities that doesn’t actually get used in any way. No one analyzes it to gain insights, drive decisions, or make money – it just sits there.

Unfortunately, dark data can prove quite troublesome, causing a host of data quality issues. But it doesn’t have to be all bad. This article will explore what you need to know about dark data, including:

  • What is dark data
  • Why dark data is troublesome
  • How dark data causes data quality issues
  • The upside of dark data
  • Top tips to shine the light on dark data

What is dark data?

According to Gartner, dark data is “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships, and direct monetizing). Storing and securing data typically incurs more expense (and sometimes greater risk) than value.”

And most companies have a lot of dark data. Carnegie Mellon University finds that about 90% of most organizations’ data is dark data, to be exact.

How did this happen? A lot of organizations operate in silos, and this can easily lead to situations in which one department would make use of the data that another department captures, but they’re not even aware that data is getting captured (and therefore they’re not using it).

We also got here because not too long ago we had the idea that it’s valuable to store all the information we could possibly capture in a big data lake. As data became more and more valuable, we thought maybe one day that data would be important – so we should hold onto it. Plus, data storage is cheap, so it was okay if it sat there totally unused. 

But maybe it’s not as good an idea as we once thought.

Why is dark data troublesome?

If the data could be valuable one day and data storage is cheap, what’s the big issue with it? There are three problems to start

1) Liability

Often with dark data, companies don’t even know exactly what type of data they’re storing. And they could very well (and often do) have personally identifiable information sitting there without even realizing it. This could come from any number of places, such as transcripts from audio conversations with customers or data shared online. But regardless of the source, storing this data is a liability. 

A host of global privacy laws have been introduced over the past several years, and they apply to all data – even data that’s sitting unused in analytics repositories. As a result, it’s risky for companies to store this data (even if they’re not using it) because there’s a big liability if anyone accesses that information.

2) Accumulated costs

Data storage at the individual level might be cheap, but as companies continue to collect and store more and more data over time, those costs add up. Some studies show companies spend anywhere from $10,000 to $50,000 in storage just for dark data alone.

Getting rid of that data that’s not used for any purpose could then lead to significant cost savings. Savings that can be re-allocated to any number of more constructive (and less troublesome) purposes.

3) Opportunity costs

Finally, many companies are losing out on opportunities by not using this data. So while it’s good to get rid of data that’s actually not usable – due to risks and costs – it pays to first analyze what data is available.

In taking a closer look at their dark data, many companies may very well find that they can better manage and use that data to drive some interesting (and valuable!) insights about their customers or their own internal metrics. Hey, it’s worth a look.

How dark data causes data quality issues

Interestingly enough, sometimes dark data gets created because of data quality issues. Maybe it’s because incomplete or inaccurate data comes in, and therefore teams know they won’t use it for anything.

For example, perhaps it’s a transcript from an audio recording, but the AI that creates the transcript isn’t quite there yet and the result is rife with errors. Someone keeps the transcript though, thinking that they’ll resolve it at some point. This is an example of how data quality issues can create dark data.

In this way, it can often be used to understand the sources of bad data quality and the effects of that. Far too often, organizations aim to clean poor quality data, but they miss what’s causing the issue. And without that understanding, it’s impossible to fully resolve the data quality issue from continuing to happen.

When this happens, the situation becomes very cyclical, because rather than simply purging dark data that sits around without ever getting used, organizations let it continue to sit – and that contributes to growing data quality issues.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

  1. Analyze and identify the “as is” situation, including the current issues, existing data standards, and the business impact in order to prioritize the issue.
  2. Prevent bad data from recurring by evaluating the root cause of the issues and applying resources to tackle that problem in a sustainable way.
  3. Communicate often along the way, sharing what’s happening, what the team is doing, the impact of that work, and how those efforts connect to business goals.

The upside of dark data

But for all the data quality issues that dark data can (and, let’s be honest, does) cause, it’s not all bad. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”

Specifically, as data remains an extremely valuable asset, organizations must learn how to use everything they have to their advantage. In other words, that nagging thought that the data just might be useful one day could actually be true. Of course, that’s only the case if organizations actually know what to do with that data… otherwise it will continue to sit around and cause data quality issues.

The key to getting value out of dark data? Shining the light on it by breaking down silos, introducing tighter data management, and, in some cases, not being afraid to let data go.

Top tips to shine the light on dark data

When it comes to handling dark data and potentially using it to your organization’s advantage, there are several best practices to follow:

  1. Break down silos: Remember earlier when we said that dark data often comes about because of silos across teams? One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos instantly makes that data available to the team that needs it, and suddenly it goes from sitting around to providing immense value.
  2. Improve data management: Next, it’s important to really get a handle on what data exists. This starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize data better with the goal of making it easier for individuals across teams to find and use what they need.
  3. Introduce a data governance policy: Finally, introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and if so, how it should be organized to maintain clear data management), archived, or destroyed. An important part of this policy is being strict about what data should be destroyed. Enforcing that policy and regularly reviewing practices can help eliminate dark data that will never really be used.

It’s time to solve the dark data challenge and restore data quality

Dark data is a very real problem. Far too many organizations hold onto data that never gets used, and while it might not seem like a big deal, it is. It can create liabilities, significant storage costs, and data quality issues. It can also lead to missed opportunities due to teams not realizing what data is potentially available to them.

Taking a proactive approach to managing this data can turn the situation around. By shining the light on dark data, organizations can not only reduce liabilities and costs, but also give teams the resources they need to better access data and understand what’s worth saving and what’s not. And doing so will also improve data quality. It’s a no-brainer.

The Data Value Chain: Data Observability’s Missing Link

Databand
2022-05-18 13:18:01

Data observability is an exploding category. It seems like there is news of another data observability tool receiving funding, an existing tool is announcing expanded functionality, and many new products in the category are being dreamt up. After a bit of poking around, you’ll notice that many of them claim to do the same thing: end-to-end data observability. But what does that really mean and what’s a data value chain?

For data analysts, end-to-end data observability feels like having monitoring capabilities for their warehouse tables  — and if they’re lucky, they have some monitoring for the pipelines that move the data to and from their warehouse as well.

The story is a lot more complicated for many other organizations that are more heavily skewed towards data engineering. For them, that isn’t end-to-end data observability. That’s “The End” data observability. Meaning: this level of observability only gives visibility into the very end of the data’s lifecycle. This is where the data value chain becomes an important concept.

For many data products, data quality is determined from the very beginning; when data is first extracted and enters your system. Therefore, shifting data observability left of the warehouse is the best way to move your data operations out of a reactive data quality management framework, to a proactive one.

data observability

What is the Data Value Chain?

When people think of data, they often think of it as a static object; a point on a chart, a number in a dashboard, or a value in a table. But the truth is data is constantly changing and transforming throughout its lifecycle. And that means what you define as “good data quality” is different for each stage of that lifecycle.

“Good” data quality in a warehouse might be defined by its uptime. Going to the preceding stage in the life cycle, that definition changes. Data quality might be defined by its freshness and format. Therefore, your data’s quality isn’t some static binary. It’s highly dependent on whether things went as expected in the preceding step of its lifecycle.

Shani Keynan, our Product Director, calls this concept the data value chain.

“From the time data is ingested, it’s moving and transforming. So, only looking at the data tables in your warehouse or your data’s source, or only looking at your data pipelines, it just doesn’t make a lot of sense. Looking only at one of those, you don’t have any context.

You need to look at the data’s entire journey. The thing is, when you’re a data-intensive company who’s using lots of external APIs and data sources, that’s a large part of the journey. The more external sources you have, the more vulnerable you are to changes you can’t predict or control. Covering the hard ground first, at the data’s extraction, makes it easier to catch and resolve problems faster since everything downstream depends on those deliveries.”

The question of whether data will drive value for your business is defined by aseries of If-Then statements:

  1. If data has been ingested correctly from our data sources, then our data will be delivered to our lake as expected.
  2. If data is delivered & grouped in our lake as expected, then our data will be able to be aggregated & delivered to our data warehouse as expected.
  3. If data is aggregated & delivered to our data warehouse as expected, then the data in our warehouse can be transformed.
  4. If data in our warehouse can be transformed correctly, then our data will be able to be queried and will provide value for the business.
data warehouse

Let us be clear: this is an oversimplification of the data’s life cycle. That said, it illustrates how having observability only for the tables in your warehouse & the downstream pipelines leaves you in a position of blind faith.

In the ideal world, you would be able to set up monitoring capabilities & data health checkpoints everywhere in your system. This is no small project for most data-intensive organizations; some would even argue it’s impractical.

Realistically, one of the best places to start your observability initiative is at the beginning of the data value chain; at the data extraction layer.

Data Value Chain + Shift-left Data Observability

If you are one of these data-driven organizations, how do you set your data team up for success?

While it’s important to have observability of the critical “checkpoints” within your system, the most important checkpoint you can have is at the data collection process. There are two reasons for that:

#1 – Ingesting data from external sources is one of the most vulnerable stages in your data model.

As a data engineer, you have some degree of control over your data & your architecture. But what you don’t control is your external data sources. When you have a data product that depends on external data arriving on time to function, that is an extremely painful experience.

This is best highlighted in an example. Let’s say you are running a large real estate platform called Willow. Willow is a marketplace where users can search for homes and apartments to buy & rent across the United States.

Willow’s goal is to give users all the information they need to make a buying decision; things like listing price, walkability scores, square footage, traffic scores, crime & safety ratings, school system ratings, etc.

In order to calculate “Traffic Score” for just one state in the US, Willow might need to ingest data from 3 external data sources. There are 50 states, so that means you suddenly have 150 external data sources you need to manage. And that’s just for one of your metrics.

Here’s where the pain comes in: You don’t control these sources. You don’t get a say whether they decide to change their API to better fit their data model. You don’t get to decide whether they drop a column from your dataset. You can’t control if they miss one of their data deliveries and leave you hanging.

All of these factors put your carefully crafted data model at risk. All of them can break your pipelines downstream that follow strictly coded logic. And there’s really nothing you can do about it except catching it as early as you can.

Having data observability in your data warehouse doesn’t so much to solve this problem. It might alert you that there is bad data in your warehouse, but by that point, it’s already too late.

This brings us to our next point…

#2 – It makes the most sense for your operational flow.

In many large data organizations, data in your warehouse is being automatically utilized in your business processes. If something breaks your data collection processes, bad data is being populated into your product dashboards and analytics and you have no way of knowing that the data they are being served is no good.

This can lead to some tangible losses. Imagine if there was a problem calculating a Comparative Analysis of home sale prices in the area. Users may lose trust in your data and stop using your product.

In this situation, what does your operational flow for incident management look like?

You receive some complaints from business stakeholders or customers, you have to invest a lot of engineering hours to perform root cause analysis, fix the issue, and backfill the data. All the while consumer trust has gone down, and SLAs have already been missed. DataOps is in a reactive position.

incident management

When you have data observability for your ingestion layer, there’s still a problem in this situation, but the way DataOps can handle this situation is very different:

  • You know that there will be a problem.     
  • You know exactly which data source is causing the problem.
  • You can project how this will affect downstream processes. You can make sure everyone downstream knows that there will be a problem so you can prevent the bad data from being used in the first place.
  • Most importantly, you can get started resolving the problem early & begin working on a way to prevent that from happening again.

You cannot achieve that level of prevention when your data observability starts at your warehouse.

Bottom Line: Time To Shift Left

DataOps is learning many of the same, hard lessons as DevOps has. Just as application observability is the most effective when shifted left, the same applies to data operations. It saves money; it saves time; it saves headaches. If you’re ingesting data from many external data sources, your organization cannot afford to focus all its efforts on the warehouse. You need real end-to-end data observability. And luckily, there’s a great data observability platform made to do just that.

data observability



Data Replication: The Basics, Risks, and Best Practices

Databand
2022-04-27 13:20:30

Data-driven organizations are poised for success. They can make more efficient and accurate decisions and their employees are not impeded by organizational silos or lack of information. Data replication enables leveraging data to its full extent. But how can organizations maximize the potential of data replication and make sure it helps them meet their goals? Read on for all the answers.

What is Data Replication?

Data replication is the process of copying or replicating data from the main organizational server or cloud instance to other cloud or on-premises instances at different locations. Thanks to data replication, organizational users can access the data they need for their work quickly and easily, wherever they are in the world. In addition, data replication ensures organizations have backups of their data, which is essential in case of an outage or disaster. In other words, data replication creates data availability at low latency.

Data replication can take place either synchronously or asynchronously. Synchronously means the data is constantly copied to the main server and all replica servers at the same time. Asynchronous data replication means that data is first copied to the main server and only then copied to replica servers. Often, it occurs in scheduled intervals.

Why Data Replication is Necessary

Data replication ensures that organizational data is always available to all stakeholders. By replicating data across instances, organizations can ensure:

Scalability

Data scalability is the ability to handle changing demands by continuously adapting resources. Replication of data across multiple servers builds scalability and ensures the availability of consistent data to all users at all times.

Disaster Protection

Electrical outages, cybersecurity attacks and natural disasters can cause systems and instances to crash and no longer be available. By replicating data across multiple instances, data is backed up and always accessible to any stakeholder. This ensures system robustness, organizational reliability and security.

Speed / Latency

Data that has to travel across the globe creates latency. This creates a poor user experience, which can be felt especially in real-time based applications like gaming or recommendation systems, or resource-heavy systems like design tools. By distributing the data globally it travels a shorter distance to the end user, which results in increased speed and performance.

Test System Performance

By distributing and synchronizing data across multiple test systems, data becomes more accessible. This availability improves their performance.

An Example of Data Replication

Organizations that have multiple branch offices across a number of continents can benefit from data replication. If organizational data only resides in servers in Europe, users from Asia, North America and South America will experience latency when attempting to read the data. But by replicating data across instances in San Francisco, São Paulo, New York, London, Berlin, Prague, Tel Aviv, Hyderabad, Singapore and Melbourne, for example, all users can improve access times for all users significantly.

Data Replication Variations

Types of Data Replication

Replication systems vary. Therefore, it is important to distinguish which type is a good fit for your organizational infrastructure needs and business goals. There are three main types of data replication systems:

Transactional Replication

Transaction replication consists of databases being copied in their entirety from the primary server (the publisher) and sent to secondary servers (subscribers). Any data changes are consistently and continuously updated. Transactional consistency is ensured, which means that data is replicated in real-time and sent from the primary server to secondary servers in the order of their occurrence. As a result, transactional replication makes it easy to track changes and any lost data. This type of replication is commonly used in server-to-server environments.

Snapshot Replication

In the snapshot replication type, a snapshot of the database is distributed from the primary server to the secondary servers. Instead of continuous updates, data is sent as it exists at the time of the snapshot. It is recommended to use this type of replication when there are not many data changes or at the initial synchronization between the publisher and subscriber.

Merge Replication

A merge replication consists of two databases being combined into a single database. As a result, any changes to data can be updated from the publisher to the subscribers. This is a complex type of replication since both parties (the primary server and the secondary servers) can make changes to the data. It is recommended to use this type of replication in a server-to-client environment.

Comparison Table: Transactional Replication vs. Snapshot replication vs. Merge Replication

Data Replication Table

Schemes of Replication

Replication schemes are the operations and tasks required to perform replication. There are three main replication schemes organizations can choose from:

Full Replication

Full replication occurs when the entire database is copied in its entirety to every site in the distributed system. This scheme improves data availability and accessibility through database redundancy. In addition, performance is improved because global distribution of data reduces latency and accelerates query execution. On the other hand, it is difficult to achieve concurrency and update processes are slow.

Data Replication - Full

Partial Replication

In a partial replication scheme, some sections of the database are replicated across some or all of the sites. The description of these fragments can be found in the replication schema. Partial replication enables prioritizing which data is important and should be replicated as well as distributing resources according to the needs of the field.

Data Replication - Partial

No Replication

In this scheme, data is stored on one site only. This enables easily recovering data and achieving concurrency. On the other hand, it negatively impacts availability and performance.

No Data Replication

Techniques of Replication

Replicating data can take place through different techniques. These include:

Full-table Replication

In a full-table replication, all data is copied from the source to the destination. This includes new data, as well as existing data. It is recommended to use this technique if records are regularly deleted or if other techniques are technically impossible. On the other hand, this technique requires more processing and network resources and the cost is higher.

Key-based Replication

In a key-based replication, only new data that has been added since the previous update, is updated. This technique is more efficient since less rows are copied. On the other hand, it does not enable replication data from a previous update that might have been hard-deleted.

Log-based Replication

A log-based replication replicates any changes to the database, from the DB log file. It applies only to database sources and has to be supported by it. This technique is recommended when the source database structure is static, otherwise it might become a very resource-intensive process.

Cloud Migration + Data Replication

When organizations digitally transform their infrastructure and migrate to the cloud, data can be replicated to cloud instances. By replicating data to the cloud, organizations can enjoy its benefits: scalability, global accessibility, data availability and easier maintenance. This means organizational users benefit from data that is more accessible, usable and reliable, which eliminates internal silos and increases business agility.

Data Risks in the Replication Process

When replicating data to the cloud, it is important to monitor the process. The growing complexity of data systems as well as the increased physical distance between servers within a system could pose some risks.

These risks include:

Inconsistency

Data schema and data profiling anomalies, like null counts, type changes and skew.

Data Loss

Ensuring all data has been migrated from the sources to the instances.

Delays

Data not being successfully migrated on time.

Data Replication Management + Observability

By implementing a management system to oversee and monitor the replications process, organizations can significantly reduce the risks involved in the data replication process. A data observability platform will ensure:

  • Data is successfully replicated to other instances, including cloud instances
  • Replication and migration pipelines are performing as expected
  • Any broken pipelines or irregular data volumes are alerted about so they can be fixed
  • Data is delivered on time 
  • Delivered data is reliable, so organizational stakeholders can use it for analytics

Monitoring

By monitoring the data pipelines that take part in the replication process, organizations and their DataOps engineer can ensure the data propagated through the pipeline is accurate, complete and reliable. This ensures data replicated to all instances can be reliably used by stakeholders. An effective monitoring system will be:

  • Granular – specifically indicating where the issue is
  • Persistent – following lineage to understand where errors began
  • Automated – reducing manual errors and enabling the use of thresholds
  • Ubiquitous – covering the pipeline end-to-end
  • Timely – enabling catching errors on time before they have an impact

Learn more about data monitoring here.

Tracking

Tracking pipelines enables systematic troubleshooting, so that any errors are identified and fixed on time. This ensures users constantly benefit from updated, reliable and healthy data in their analyses. There are various types of metadata that can be tracked, like task duration, task status, when data was updated, and more. By tracking and alerting (see below) in case of anomalies, DataOps engineers ensure data health.

Alerting

Alerting about and data pipeline anomalies is an essential step that closes the observability loop. Alerting DataOps engineers gives them the opportunity to fix any data health issues that might affect data replication across various instances.

Within existing data systems, data engineers can trigger alerts for:

  • Missed data deliveries
  • Schema changes that are unexpected
  • SLA misses
  • anomalies in column-level statistics like nulls and distributions
  • Irregular data volumes and sizes
  • Pipeline failures, inefficiencies, and errors

By proactively setting up alerts and monitoring them through dashboards and other tools of your choice (Slack, Pagerduty, etc.), organizations can truly maximize the potential of data replication for their business.

Conclusion

Data replication holds great promise for organizations. By replicating data to multiple instances, they can ensure data availability and improved performance, as well as internal “insurance” in case of a disaster. This page covers the basics for any business or data engineer getting started with data replication: the variations, schemes and techniques, as well as more advanced content for monitoring the process to gain observability and reduce the potential risk.

Wherever you are on your data replication journey, we recommend auditing your pipelines to ensure data health. If you need help finding and fixing data health issues fast, click here.

The Top Data Quality Metrics You Need to Know (With Examples)

Databand
2022-04-20 14:17:41

Data quality metrics can be a touchy subject, especially within the focus of data observability.

A quick google search will show that data quality metrics involve all sorts of categories. 

For example, completeness, consistency, conformity, accuracy, integrity, timeliness, continuity, availability, reliability, reproducibility, searchability, comparability, and probably ten other categories I forgot to mention all relate to data quality. 

So what are the right metrics to track? Well, we’re glad you asked. 🙂 

We’ve compiled a list of the top data quality metrics that you can use to measure the quality of the data in your environment. Plus, we’ve added a few screenshots that highlight each data quality metric you can view in Databand’s observability platform

Take a look and let us know what other metrics you think we need to add!

Collection Data Quality Metrics

The Top 9 Data Quality Metrics

Metric 1: # of Nulls in Different Columns 

Who’s it for? 

  • Data engineers
  • Data analysts

How to track it? 

Calculate the number of nulls, non-null counts, and null percentages per column so users can set an alert on those metrics.

Why it’s important?

Since a null is the absence of value, you want to be aware of any nulls that pass through your data workflows. 

For example, downstream processes might be damaged if the data used is now “null” instead of actual data.

Dropped columns

The values of a column might be “dropped” by mistake when the data processes are not performing as expected. 

This might cause the entire column to disappear, which would make the issue easier to see. But sometimes, all of its values will be null.

Data drift

The data of a column might slowly drift into “nullness.” 

This is more difficult to detect than the above since the change is more gradual. Monitoring anomalies in the percentage of nulls across different columns should make it easier to see.

What’s it look like?

Data Quality Metrics Null Count

Metric 2: Frequency of Schema Changes

Who’s it for?

  • Data engineers
  • Data scientists
  • Data analysts

How to track it? 

Tracking all changes in the schema for all the datasets related to a certain job.

Why it’s important?

Schema changes are key signals of bad quality data. 

In a healthy situation, schema changes are communicated in advance and are not frequent since many processes rely on the number of columns and their type in each table to be stable. 

Frequent changes might indicate an unreliable data source and problematic DataOps practices, resulting in downstream data issues.

Examples of changes in the schema can be: 

  • Column type changes
  • New columns 
  • Removed columns

Go beyond having a good understanding of what changed in the schema and evaluate the effect this change will have on downstream pipelines and datasets.

What’s it look like?

Data Quality Metrics Schema change
Data Quality Metrics Alert

Metric 3: Data Lineage, Affected Processes Downstream

Who’s it for? 

  • Data engineers
  • Data analysts

How to track it? 

Tack the data linage with assets that appear downstream from a dataset with an issue. This includes datasets and pipelines that consume the upstream dataset’s data.

Why it’s important?

The more damaged data assets (datasets or pipelines) downstream, the bigger the issue’s impact. This metric helps the data engineer to understand the severity of the issue and how fast he should fix it.

It is also an important metric for data analysts because most downstream datasets make up their company’s BI reports.

What’s it look like?

Data Quality Metrics Lineage

Metric 4: # of Pipeline Failures 

Who’s it for? 

  • Data engineers
  • Data executives

How to track it? 

Track the number of failed pipelines over time. 

Use tools to understand why the pipeline failed, root cause analysis through the error widget and logs, and the ability to dive inside all the tasks that the DAG contains.

Why it’s important?

The more pipelines fail, the more data health issues you’ll have.

Each pipeline failure causes issues like missing data operations, schema changes, and data freshness issues.

If you’re experiencing many failures, this indicates severe problems at the root that needs to be addressed.

What’s it look like?

Data Quality Metrics Error widget, pipeline, tasks

Metric 5: Pipeline Duration

Who’s it for? 

  • Data engineers

How to track it? 

The team can track this with the Airflow syncer, which reports on the total duration of a DAG run, or by using our tracking context as part of the Databand SDK.

Why it’s important?

Pipelines that work in complex data processes are usually expected to have similar duration across different runs. 

In these complex environments, pipelines downstream depend on upstream pipelines processing the data in certain SLAs

The effect of extreme changes in the pipeline’s duration can be anywhere between the processing of stale data and a failure of downstream processes.

What’s it look like?

Data Quality Metrics Pipeline duration

Metric 6: Missing Data Operations

Who’s it for? 

  • Data engineers
  • Data scientists
  • Data analysts
  • Data executives

How to track it? 

Tracking all the operations related to a particular dataset.

A data operation is a combination of a task in a specific pipeline that reads or writes to a table. 

Why it’s important?

When a certain data operation is missing, a chain of issues in your data stack will be triggered. It can cause pipelines to fail, changes in the schema, and delay problems.

Also, the downstream consumers of this data will be affected by the data that didn’t arrive.  

A few examples include: 

  • The data analyst who is using this data for analysis 
  • The ML models used by the data scientist
  • The data engineers in charge of the data.

What’s it look like?

Data Quality Metrics Missing dataset
ata Quality Metrics dbnd alert

Metric 7: Record Count in a Run

Who’s it for? 

Data engineers, data analysts

How to track it? 

Track the number of raws written to a dataset.

Why it’s important?

A sudden change in the expected number of table rows signals that too much data is being written. 

Using anomaly detection in the number of rows in a dataset provides a good way of checking that nothing suspicious has happened.

What’s it look like?

Data Quality Metrics Record count in a run

Metric 8: # of Tasks Read From Dataset

Who’s it for? 

Data engineer

How to track it? 

The more tasks read from a certain dataset, the more central it is and the more important this dataset. 

Why it’s important?

Understanding the importance of the dataset is crucial for impact analysis and realizing how fast you should deal with the issue you have.

What’s it look like?

Data Quality Metrics - Tasks Read from Dataset

Metric 9: Data Freshness (SLA alert)

Who’s it for? 

Data Engineers, Data Scientists, Data Analysts

How to track it? 

We are tracking the scheduled pipelines to write to a certain dataset.

Why it’s important?

Un-fresh and un-updated data can cause wrong feeding of downstream reports and wrong information to be consumed.

A good way of knowing data freshness is to monitor your SLA and get notified of delays in the pipeline that should be written to the dataset.

What’s it look like?

Data Quality Metrics SLA alert

Wrapping it up

And that’s a quick look at some of the top data quality metrics you need to know to deliver more trustworthy data to the business. 

Check out how you can build all these metrics in Databand today.

What is a Data Catalog? Overview and Top Tools to Know

Databand
2022-04-14 12:01:00

Intro to Data Catalogs

A data catalog is an inventory of all of an organization’s data assets. A data catalog includes assets like machine learning models, structured data, unstructured data, data reports, and more. By leveraging data management tools, data analysts, data scientists, and other data users can search through the catalog, find the organizational data they need, and access it.

Governance of data assets in a data catalog is enabled through metadata. The metadata is used for mapping, describing, tagging, and organizing the data assets. As a result, it can be leveraged to enable data consumers to efficiently search through assets and get information on how to use the data. Metadata can also be used for augmenting data management, by enabling onboarding automation, anomalies alerts, auto-scaling, and more.
In addition to indexing the assets, a data catalog usually includes data access and data searching capabilities, as well as tools for enriching the metadata, both manually and automatically. It also provides capabilities for ensuring compliance with privacy regulations and security standards.
In modern organizations, data catalogs have become essential for leveraging the large amounts of data generated. Efficient data analysis and consumption can help organizations make better decisions, so they can optimize operations, build better models, increase sales, and more.

Data Catalog Benefits (Why Do You Need a Data Catalog?)

A data catalog provides multiple benefits to data professionals, business analysts, and organizations. These include:

User Autonomy

Data professionals and other data consumers can find data, evaluate it and understand how to use it – all on their own. With a data catalog, they no longer have to rely on IT or other professional personnel. Instead, they can immediately search for the data they need and use it. This speed and independence enable injecting data into more business operations. It also improves employee morale.

Improved Data Context and Quality

The metadata and comments on the data from other data citizens can help data consumers better understand how to use it. This additional information creates context and improves the data quality and encourages data usage, innovation, and more new business ideas.

Organizational Efficiency

Accessible data reduces operational friction and bottlenecks, like back and forth emails, which optimizes the use of organizational resources. Available data also accelerates internal processes. When data consumers get the data and understand how to use it faster, data analysis and implementation take place faster as well, benefiting the business.

Compliance and Security 

Data catalogs that ensure data assets comply with privacy standards and security regulations, and reduce the risks of data breaches, cyberattacks, or legal fiascos.

New Business Opportunities

By giving data citizens new information they can incorporate into their work and decision-making, they will find new ways to answer work challenges and achieve their business goals. This can open up new business opportunities, across all departments.

Better Decision Making

Lack of data visibility makes organizations rely on tribal knowledge, rely on data they are already familiar with, or recreate assets that already exist. This creates organizational data silos, which impede productivity. Enabling data access to everyone improves the ability to find and use data consistently and continuously across the organization.

What Does a Data Catalog Contain?

Different data catalogs offer somewhat different features. However, to enable data governance and advanced analysis, they should all provide the following to data consumers:

Metadata

Technical Metadata

The data that describes the structure of the objects, like tables, schemas, columns, rows, file names, etc.

Business Metadata

Data about the business value of the data, like its purpose, compliance info, rating, classification, etc.

Process Metadata

Data about the asset creation process and lineage, like who changed it and when, permissions, latest update time, etc.

Search Capabilities

Searching, browsing, and filtering options to enable data consumers to easily find the relevant data assets.

Metadata Enrichment

The ability to automatically enrich metadata through mappings and connections, as well as letting data citizens manually contribute to the metadata.

Compliance Capabilities

Embedded capabilities that ensure data can be trusted and no sensitive data is exposed. This is important for complying with regulations, standards, and policies. 

Asset Connectivity

The ability to connect to and automatically map all types of data sources your organization uses, at the locations they reside at.

In addition, in technologically advanced and enterprise data catalogs, AI and machine learning are implemented.

Data Catalog Use Cases

Data catalogs can and should be consumed by all people in the organization. Some popular use cases include:

  • Optimizing the data pipeline
  • Data lake modernization
  • Self-service analytics
  • Cloud spend management
  • Advanced analytics
  • Reducing fraud risk
  • Compliance audits
  • And more

Who Uses a Data Catalog?

A data catalog can be used by data-savvy citizens, like data analysts, data scientists and data engineers. But all business employees – product, marketing, sales, customer success, etc – can work with data and benefit from a data catalog. Data catalogs are managed by data stewards.

Top 10 Data Catalog Tools

Here are the top 10 data catalog tools according to G2, as of Q1 2022:

1. AWS

  • Product Name: AWS Glue
  • Product Description: AWS Glue is a serverless data integration service for discovering, preparing, and combining data for analytics, machine learning and application development. Data engineers and ETL developers can visually create, run, and monitor ETL workflows. Data analysts and data scientists can enrich, clean, and normalize data without writing code. Application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

2. Aginity

  • Product Name: Aginity
  • Product Description: Aginity provides a SQL coding solution for data analysts, data engineers, and data scientists so they can find, manage, govern, share and re-use SQL rather than recode it.

3. Alation

  • Product Name: Alation Data Catalog
  • Product Description: ​​Alation’s data catalog indexes a wide variety of data sources, including relational databases, cloud data lakes, and file systems using machine learning. Alation enables company-wide access to data and also surfaces recommendations, flags, and policies as data consumers query in a built-in SQL editor or search using natural language. Alation connects to a wide range of popular data sources and BI tools through APIs and an Open Connector SDK to streamline analytics.

4. Collibra

  • Product Name: Collibra Data Catalog
  • Product Description: Collibra ensures teams can quickly find, understand and access data across sources, business applications, BI, and data science tools in one central location. Features include out-of-the-box integrations for common data sources, business applications, BI and data science tools; machine learning-powered automation capabilities; automated relationship mapping; and data governance and privacy capabilities.

5. IBM

  • Product Name: IBM Watson Knowledge Catalog
  • Product Description:  A data catalog tool based on self-service discovery of data, models and more. The cloud-based enterprise metadata repository activates information for AI, machine learning (ML), and deep learning. IBM’s data catalog enables stakeholders to access, curate, categorize and share data, knowledge assets and their relationships, wherever they reside.

6. Appen

  • Product Name: Appen
  • Product Description: Appen provides a licensable data annotation platform for training data use cases in computer vision and natural language processing. In order to create training data, Appen collects and labels images, text, speech, audio, video, and other data. Its Smart Labeling and Pre-Labeling features that use Machine Learning ease human annotations.

7. Denodo

  • Product Name: Denodo
  • Product Description: Denodo provides data virtualization that enables access to the cloud, big data, and unstructured data sources in their original repositories. Denodo enables the building of customized data models for customers and supports multiple viewing formats.

8. Oracle

  • Product Name: Oracle Enterprise Metadata Management 
  • Product Description: Oracle Enterprise Metadata Management harvests metadata from Oracle and third-party data integrations, business intelligence, ETL, big data, database, and data warehousing technologies. It enables business reporting, versioning, and comparison of metadata models, metadata search and browsing, and data lineage and impact analysis reports.

9. Unifi

  • Product Name: Unifi Data Catalog
  • Product Description: A standalone Data Catalog with intuitive natural language search powered by AI, collaboration capabilities for crowd-sourced data quality, views of trusted data, and all fully governed by IT. The Unifi Data Catalog offers data source cataloging, search and discovery capabilities throughout all data locations and structures, auto-generated recommendations to view and explore data sets and similar data sets, integration to catalog Tableau metadata, and the ability to deconstruct TWBX files and see the full lineage of a data source to see how data sets were transformed.

10. BMC

  • Product Name: Catalog Manager for IMS
  • Product Description: A system database that stores metadata about databases and applications. Catalog Manager for IMS enables viewing IMS catalog content, reporting on the control block information in the IMS catalog, and creating jobs to do DBDGENs, PSBGENs, and ACBGENs to populate the catalog.

Data Lakes and Data Catalogs

A data catalog can organize and govern data that reside in repositories, data lakes, data warehouses, or other locations. A data catalog can help organize the unstructured data in the data lake, preventing it from turning into a “data swamp”. As a result, data scientists and data analysts can easily pull data from the lake, evaluate it and use it.

A Data Catalog and Databand

Databand is a proactive observability platform for monitoring and controlling data quality, as early as ingestion. By integrating Databand with your data catalog, you can gain extended lineage, and visualize and observe the data from its source and as it flows through the pipelines all the way to the assets the data catalog maps and governs. As a result, data scientists, engineers and other data professionals can see and understand the complete flow of data, end-to-end.

In addition, by integrating Databand with your data catalog, you can get proactive alerts any time your data quality is affected to increase governance and robustness. This is enabled through Databand’s data quality identification capabilities, combined with how data catalogs map assets to owners. Databand will communicate any data quality issues to the relevant data owners.