Press Release: IBM Acquires Databand to Extend Leadership in Observability Read now

Why data engineers need a single pane of glass for data observability

Databand
2022-03-11 15:13:43

Why data engineers need a single pane of glass for data observability

Data engineers manage data from multiple sources and throughout pipelines. But what happens when a data deficiency occurs? Data observability provides engineers with the information and recommendations needed to fix data issues, without them having to comb through huge piles of data. Read on about what is observability and the best way to implement it.

If you’re interested in learning more, you can listen to the podcast this blog post is based on below or here.

What is Data Observability?

In today’s world, we have the capacity and ability to track almost any piece of data. But attempting to find relevant information in such huge volumes of data is not always so easy to do. Data observability is the techniques and methodologies that bring the right and relevant level of data information to data engineers at the right time, so they can understand problems and solve them.

Data observability provides data engineers with metrics and recommendations that help them understand how the system is operating. Through observability, data engineers can better set up systems and pipelines, observe the data as it flows through the pipeline, and investigate how it affects data that is already in their warehouse. In other words, data observability makes it easier for engineers to access their data and act upon any issues that occur.

With data observability, data engineers can answer questions like:

  • Are my pipelines running with the correct data?
  • What happens to the data as it flows through the pipelines?
  • What does my data look like once it’s in the warehouse, data lake, or lakehouse?

Why We Need Data Observability

Achieving observability is never easy, but ingesting data from multiple sources makes it even harder. Enterprises often work with hundreds of sources, and even nimble startups rely on a considerable number of data sources for their products. Yet, today’s data engineering teams aren’t equipped with tools and resources to manage all that complexity.

As a result, engineers are finding it difficult to ensure the reliability and quality of the data that is coming in and flowing through the pipelines. Schema changes, missing data, null records, failed pipelines, and more – all impact how the business can use data. If engineers can’t identify and fix data deficiencies before they make a business impact, the business can’t rely on it.

Achieving Data Observability with a Single Pane of Glass

The data ecosystem is fairly new and it is constantly changing. New open source and commercial solutions emerge all the time. As a result, the modern data stack is made up of multiple point solutions for data engineers. These include tools for ETLs, operational analytics, data warehouses, dbt, extraction and loading tools, and more. This fragmentation makes it hard for organizations to manage and monitor their data pipeline.

A recent customer in the Cryptocurrency Industry said this:

“We spend a lot of time fixing operational issues due to fragmentation of our data stack.“

Tracking data quality, lineage, and schema changes becomes a nightmare.”

The one solution missing from this stack is a single overarching operating system for orchestrating, integrating and monitoring the stack, i.e – a single tool for data observability. A single pane of glass for observability could enable engineers to look at various sources of data in a single place and see what has changed. They could identify changed schemas or faulty columns. Then, they could build automated checks to ensure errors wouldn’t recur.

For engineers, this is a huge time saver. For organizations, this means they can use their data for making decisions.

As we see the data ecosystem flow from fragmentation to consolidation, here are a few features a data observability system should provide data engineers with:

  • Visualization – enabling data engineers to see data reads, writes and lineage throughout the pipeline and the impact  of new data on warehouse data.
  • Supporting all data sources – showing data from all sources, and showing it as early as ingestion
  • Supporting all environments – observing all environments, pre-prod and prod, existing and new
  • Alerts – notifying data engineers about any anomalies, missed data deliveries, irregular volumes, pipeline failures or schema changes and providing recommendations for fixing issues
  • Continuous testing – running through data source, tables and pipelines multiple times a day, and even in real-time if your business case requires it (like in healthcare or gaming)

Databand provides unified visibility for data engineers across all data sources. Learn more here.

How to ensure data quality, value, and reliability

Databand
2022-02-23 09:46:43

How to ensure data quality, value, and reliability

The quality of data downstream relies directly on data quality in the first mile. As early as ingestion, accurate and reliable data will ensure that the data used downstream for analytics, visualization, and data science will be of high value.

For a business, this makes all the difference between benefiting from the data and having it play second fiddle when making decisions. In this blog post, we describe the importance of data quality, how to audit and monitor your data, and how to get your leadership, colleagues, and board – on board.

Topics covered:

  • Proactive Data Observability
  • Auditing Data for Quality
  • Data Quality or Data Value?
  • How to Approach the C-level and the Board
  • How to Train Internally
  • The Curse of the “Other”
  • Best Practices for Getting Started: Ensuring Data Quality Across the Enterprise

Proactive Data Observability

Managing data is like running a marathon. Many factors determine the end result, and it is a long process. However, suppose a runner trips and hurts her ankle at that first mile. In that case, she will not successfully complete the marathon. Similarly, if data isn’t monitored as early as ingestion, the rest of the pipeline will be negatively impacted.

How can we ensure data governance during this first mile of the data journey?

Data enters the pipeline from various sources: external APIs, data drops from outside providers, pulling from a database, etc. Monitoring data at the ingestion points ensures data engineers can gain proactive observability of the data coming in.

This enables them to wrangle and fix data to assure the process is healthy and reliable from the get-go.

By gaining proactive observability of data pipelines, data engineers can:

  • Trust the data
  • Easily identify breaking points
  • Quickly fix issues before they arrive at the warehouse or dashboard

Auditing Data for Quality

Data engineers who want to review their pipeline or audit and monitor an external data source can use the following questions during their evaluation:

  1. What’s the coverage scope?
  2. How is the data being tracked?
  3. Is there a master data reference that includes requirements and metadata?
  4. Is the customer defined in the right way?
  5. Is there a common hierarchy?
  6. Do the taxonomies leverage the business requirements?
  7. Are geographies correctly set?
  8. Are there any duplicates?
  9. Was the data searched before creating new entities?
  10. Is the data structured to enable seamless integrations and interoperability?

Now that we’ve covered how data engineers can approach data quality let’s see how to get buy-in from additional stakeholders in the enterprise.

Data Quality or Data Value?

Data engineers often talk about the quality of data. However, by changing the conversation to the value of the data, additional stakeholders in the organizations could be encouraged to take a more significant part in the data process. This is important for getting attention, resources, and for ongoing assistance.

To do so, we recommend talking about how the data aligns with business objectives. Otherwise, external stakeholders might think the conversation revolves only around cleaning up data.

4 Criterion for Determining Data Value – for Engineers and the Business:

  • Relevancy – Does the data meet the business objective?
  • Coverage – Does the data cover the entire market, enabling the enterprise to put it into play?
  • Structure – Is the data structured so the enterprise can use it?
  • Accuracy – is the data complete and correct?

How to Approach the C-level and the Board

By shifting the conversation to the value of the data rather than its quality, the C-level and the board can be encouraged to invest more resources into the data pipeline. Here’s how to approach them:

  1. Begin with the reasons why managing data is of strategic importance to your enterprise. Show how data can help execute strategic intentions.
  2. Explain how managing and analyzing data can help the company get to where it needs to go. Show how data can grow, improve, and protect the business. You can weave in the four criteria from before to emphasize your points.
  3. Connect the data to specific departments. Show how data can help improve operational efficiency, grow sales and mitigate risk. No other department can claim to help grow, improve and protect all departments to the same extent that data engineering can.
  4. Do not focus on the process and the technology – otherwise, you will have a very small audience.

How to Train Internally

In addition to the company’s leadership, it’s also important to get people on board in the company. This will help with data analysis and monitoring. Data engineers often need the company’s employees to participate in the ongoing effort of maintaining data. For example, salespeople are required to fill out multiple fields in a CRM when adding a new opportunity.

We recommend investing time in people management, i.e., training and ensuring everyone is on the same page regarding the importance of data quality. For example, explaining how identifying discrepancies accurately can help discover a business anomaly (rather than a data anomaly, which could happen if people don’t consistently and comprehensively update data).

The Curse of the “Other”

Data value auditing is crucial because it directly impacts the ability to make decisions on top of it. If you need an example to convince employees to participate in data management, remind them of “the curse of the ‘other’.”

When business units like marketing, product, and sales monitor dashboards, and a big slice is titled “other”, they do not have all the data they need and their decision-making is impaired. This is the result of a lack of data management and data governance.

Best Practices for Getting Started: Ensuring Data Quality Across the Enterprise

How can data engineers turn data quality from an abstract theory into practice? Let’s tie up everything we’ve covered into an actionable plan.

Step 1 – Audit the Data Situation

First, assess which domains should be covered and how well they are being managed. This includes data types like:

  • Relationship data: with customers, vendors, partners, prospects, citizens, patients, and clients
  • Brand data: products, services, offerings, banners, etc.

Identify the mistakes at the different pipeline stages, starting from ingestion.

Step 2 – Showcase the Data Pipeline

Present the data situation to the various stakeholders. Show how the data is managed from the entry point to the end product. Then, explain how the current data value is impacting their decisions. Present the error points and suggest ways to fix them.

Step 3 – Prioritize Issues to Fix

Build a prioritized plan for driving change. Determine which issues to fix first. Include identifying sources and how they send data, internal data management, and training employees. Get buy-in to the plan, and proceed to execute it.

Conclusion

Ensuring data quality is the responsibility of data engineers and the entire organization. Monitoring data quality starts at the source. However, by getting buy-in from employees and management, data engineers can ensure they will get the resources and attention needed to monitor and fix data issues throughout the pipeline, and help the business grow.
To try out Databand, the observability platform for data quality and value, click here.

Ensuring data quality in healthcare: challenges and best practices

Databand
2022-02-11 14:50:11

Ensuring data quality in healthcare: challenges and best practices

The healthcare industry is very data-intensive. Multiple actors and organizations are transmitting large amounts of sensitive information. Data engineers in healthcare are tasked with ensuring data quality and reliability. This blog provides insights into how data engineers can proactively ensure data quality and prevent common errors by building the right data infrastructure and monitoring as early as ingestion.

This blog post is based on the podcast episode “Proactive Data Quality for Data-Intensive Organizations” with Johannes Leppae, Sr. Data Engineer at Komodo Health, which you can listen to below or here.

The Role of Data in Healthcare

The healthcare industry is made up of multiple institutions, service providers, and professionals. These include suppliers, doctors, hospitals, healthcare insurance companies, biopharma companies, laboratories, pharmacies, caregivers, and more. Each of these players creates, consumes, and relies on data for their operations.

High-quality and accurate data is essential for providing quality healthcare at low costs. For example, when running clinical trials, data is required to analyze patient populations, profile sites of care, alert when intervention is needed, and monitor the patient journey (among other needs).

Quality data will ensure a clinical trial is successful, resulting in better and faster patient treatment. However, erroneous or incomplete data could yield biased or noisy results, which could have severe consequences for patients.

Data Quality Challenges in Healthcare

Data engineers in healthcare need to reliably and seamlessly link together different types of sources and data. Then, they need to analyze the data to ensure it is complete and comprehensive so the downstream users have complete visibility.

However, the complexity of the healthcare system and the sensitivity of its data pose several data quality challenges for data engineers:

  • Fragmentation – Data is divided between many data assets, each containing a small piece of information.
  • Inconsistency – Data is created differently at each source. This includes variance between interfaces, filetypes, encryptions, and more.
  • Maintaining privacy – In many cases, like clinical trials, data needs to be de-identified to protect patients and ensure results are not biased.
  • Source orchestration – Ingesting data from multiple sources creates a lot of overhead when monitoring data.
  • Domain knowledge – Processing and managing healthcare data requires industry-specific knowledge since the data is often subject to medical business logic.

Ensuring Data Quality as Early as Ingestion

To overcome these challenges, data engineers need to find methods for monitoring errors. Data engineers can ensure that any issues are captured early by getting the data ready at the ingestion point. This prevents corrupt data from reaching downstream users, assures regulation compliance, and ensures data arrives on time. Early detection also saves data engineers from having to rerun pipelines when issues are found.

How big is the detection difference? Early detection enables identifying issues within hours. Later in the pipeline, the same issue could take days to detect.

One recommended way to ensure and monitor data quality is through structure and automation. The ingestion pipeline includes the following steps (among others):

  • Extraction of data files from external sources
  • Consolidating any variations
  • Pipeline orchestration
  • Raw data ingestion
  • Unifying file formats
  • Validation

To enable automation and scalability, it is recommended to create a unified structure across all pipelines and enforce systematic conventions for each stage.

For example, collecting metadata like source identification, environment, data stream, and more. The conventions will be checked in the validation step before moving the data files downstream.

How to Deal with Data Quality Issues

The challenges of data-intensive ingesting sometimes require finding creative solutions. In the podcast this blog post is based on, Johannes describes the following scenario his data engineering team deals with constantly.

A common delivery issue in healthcare is data deliveries being late. Komodo Health’s systems had defined logic that matched the file’s date with the execution date. However, since files were often sent late, the dates didn’t match, and the pipeline wouldn’t find the file. This required the team to rerun the pipeline manually. To overcome this issue, the data engineering team changed the logic so that the pipeline picked up all files within the file’s timestamp. The late delivery was then automatically captured without needing manual intervention again.

In some cases, however, fixing issues requires going back to the source and asking the data engineering team to fix it. To minimize these cases and the friction they might cause, it’s recommended to create agreements to ensure everyone is on the same page when setting up the process. The agreement should include expectations, delivery standards, and SLAs, among others.

You can also make suggestions that will help with deliveries. For example, when deliveries have multiple files, ask the source to add a manifest file that states the number of files, the number of records for each file, and the last file being sent.

Catching issues and bad batches of data on time is very important since it could significantly impact downstream users. It is especially important to be cautious in healthcare since analyses and life and death decisions are being made based on the data.

Choosing the Right Tools for Healthcare Data Engineering

Data engineers in healthcare face multiple challenges and require tools to assist them. While some prefer homegrown tools that support flexibility, buying a tool can relieve some of the effort and free engineers up for dealing with data quality issues.

When choosing a tool, it’s recommended to:

  1. Determine non-negotiables – features and capabilities the tool has to support.
  2. Decide on nice-to-haves – abilities that could help and make your life easier.
  3. Understand the roadmap – to see which features are expected to be added and determine how much influence you have over it.

Whichever tool you choose, make sure to see a demo of it. To see a demo of a Databand, which enables data quality monitoring as early as ingestion, click here.
To learn more about data-intensive organizations and hear the entire episode this blog post was based on, visit our podcast, here.