MAD Data Podcast: Hear how experts are using ML, AI, and data to transform their business. Listen now

How to ensure data quality, value, and reliability

2022-02-23 09:46:43

How to ensure data quality, value, and reliability

The quality of data downstream relies directly on data quality in the first mile. As early as ingestion, accurate and reliable data will ensure that the data used downstream for analytics, visualization, and data science will be of high value.

For a business, this makes all the difference between benefiting from the data and having it play second fiddle when making decisions. In this blog post, we describe the importance of data quality, how to audit and monitor your data, and how to get your leadership, colleagues, and board – on board.

Topics covered:

  • Proactive Data Observability
  • Auditing Data for Quality
  • Data Quality or Data Value?
  • How to Approach the C-level and the Board
  • How to Train Internally
  • The Curse of the “Other”
  • Best Practices for Getting Started: Ensuring Data Quality Across the Enterprise

Proactive Data Observability

Managing data is like running a marathon. Many factors determine the end result, and it is a long process. However, suppose a runner trips and hurts her ankle at that first mile. In that case, she will not successfully complete the marathon. Similarly, if data isn’t monitored as early as ingestion, the rest of the pipeline will be negatively impacted.

How can we ensure data governance during this first mile of the data journey?

Data enters the pipeline from various sources: external APIs, data drops from outside providers, pulling from a database, etc. Monitoring data at the ingestion points ensures data engineers can gain proactive observability of the data coming in.

This enables them to wrangle and fix data to assure the process is healthy and reliable from the get-go.

By gaining proactive observability of data pipelines, data engineers can:

  • Trust the data
  • Easily identify breaking points
  • Quickly fix issues before they arrive at the warehouse or dashboard

Auditing Data for Quality

Data engineers who want to review their pipeline or audit and monitor an external data source can use the following questions during their evaluation:

  1. What’s the coverage scope?
  2. How is the data being tracked?
  3. Is there a master data reference that includes requirements and metadata?
  4. Is the customer defined in the right way?
  5. Is there a common hierarchy?
  6. Do the taxonomies leverage the business requirements?
  7. Are geographies correctly set?
  8. Are there any duplicates?
  9. Was the data searched before creating new entities?
  10. Is the data structured to enable seamless integrations and interoperability?

Now that we’ve covered how data engineers can approach data quality let’s see how to get buy-in from additional stakeholders in the enterprise.

Data Quality or Data Value?

Data engineers often talk about the quality of data. However, by changing the conversation to the value of the data, additional stakeholders in the organizations could be encouraged to take a more significant part in the data process. This is important for getting attention, resources, and for ongoing assistance.

To do so, we recommend talking about how the data aligns with business objectives. Otherwise, external stakeholders might think the conversation revolves only around cleaning up data.

4 Criterion for Determining Data Value – for Engineers and the Business:

  • Relevancy – Does the data meet the business objective?
  • Coverage – Does the data cover the entire market, enabling the enterprise to put it into play?
  • Structure – Is the data structured so the enterprise can use it?
  • Accuracy – is the data complete and correct?

How to Approach the C-level and the Board

By shifting the conversation to the value of the data rather than its quality, the C-level and the board can be encouraged to invest more resources into the data pipeline. Here’s how to approach them:

  1. Begin with the reasons why managing data is of strategic importance to your enterprise. Show how data can help execute strategic intentions.
  2. Explain how managing and analyzing data can help the company get to where it needs to go. Show how data can grow, improve, and protect the business. You can weave in the four criteria from before to emphasize your points.
  3. Connect the data to specific departments. Show how data can help improve operational efficiency, grow sales and mitigate risk. No other department can claim to help grow, improve and protect all departments to the same extent that data engineering can.
  4. Do not focus on the process and the technology – otherwise, you will have a very small audience.

How to Train Internally

In addition to the company’s leadership, it’s also important to get people on board in the company. This will help with data analysis and monitoring. Data engineers often need the company’s employees to participate in the ongoing effort of maintaining data. For example, salespeople are required to fill out multiple fields in a CRM when adding a new opportunity.

We recommend investing time in people management, i.e., training and ensuring everyone is on the same page regarding the importance of data quality. For example, explaining how identifying discrepancies accurately can help discover a business anomaly (rather than a data anomaly, which could happen if people don’t consistently and comprehensively update data).

The Curse of the “Other”

Data value auditing is crucial because it directly impacts the ability to make decisions on top of it. If you need an example to convince employees to participate in data management, remind them of “the curse of the ‘other’.”

When business units like marketing, product, and sales monitor dashboards, and a big slice is titled “other”, they do not have all the data they need and their decision-making is impaired. This is the result of a lack of data management and data governance.

Best Practices for Getting Started: Ensuring Data Quality Across the Enterprise

How can data engineers turn data quality from an abstract theory into practice? Let’s tie up everything we’ve covered into an actionable plan.

Step 1 – Audit the Data Situation

First, assess which domains should be covered and how well they are being managed. This includes data types like:

  • Relationship data: with customers, vendors, partners, prospects, citizens, patients, and clients
  • Brand data: products, services, offerings, banners, etc.

Identify the mistakes at the different pipeline stages, starting from ingestion.

Step 2 – Showcase the Data Pipeline

Present the data situation to the various stakeholders. Show how the data is managed from the entry point to the end product. Then, explain how the current data value is impacting their decisions. Present the error points and suggest ways to fix them.

Step 3 – Prioritize Issues to Fix

Build a prioritized plan for driving change. Determine which issues to fix first. Include identifying sources and how they send data, internal data management, and training employees. Get buy-in to the plan, and proceed to execute it.


Ensuring data quality is the responsibility of data engineers and the entire organization. Monitoring data quality starts at the source. However, by getting buy-in from employees and management, data engineers can ensure they will get the resources and attention needed to monitor and fix data issues throughout the pipeline, and help the business grow.
To try out Databand, the observability platform for data quality and value, click here.

Ensuring data quality in healthcare: challenges and best practices

2022-02-11 14:50:11

Ensuring data quality in healthcare: challenges and best practices

The healthcare industry is very data-intensive. Multiple actors and organizations are transmitting large amounts of sensitive information. Data engineers in healthcare are tasked with ensuring data quality and reliability. This blog provides insights into how data engineers can proactively ensure data quality and prevent common errors by building the right data infrastructure and monitoring as early as ingestion.

This blog post is based on the podcast episode “Proactive Data Quality for Data-Intensive Organizations” with Johannes Leppae, Sr. Data Engineer at Komodo Health, which you can listen to below or here.

The Role of Data in Healthcare

The healthcare industry is made up of multiple institutions, service providers, and professionals. These include suppliers, doctors, hospitals, healthcare insurance companies, biopharma companies, laboratories, pharmacies, caregivers, and more. Each of these players creates, consumes, and relies on data for their operations.

High-quality and accurate data is essential for providing quality healthcare at low costs. For example, when running clinical trials, data is required to analyze patient populations, profile sites of care, alert when intervention is needed, and monitor the patient journey (among other needs).

Quality data will ensure a clinical trial is successful, resulting in better and faster patient treatment. However, erroneous or incomplete data could yield biased or noisy results, which could have severe consequences for patients.

Data Quality Challenges in Healthcare

Data engineers in healthcare need to reliably and seamlessly link together different types of sources and data. Then, they need to analyze the data to ensure it is complete and comprehensive so the downstream users have complete visibility.

However, the complexity of the healthcare system and the sensitivity of its data pose several data quality challenges for data engineers:

  • Fragmentation – Data is divided between many data assets, each containing a small piece of information.
  • Inconsistency – Data is created differently at each source. This includes variance between interfaces, filetypes, encryptions, and more.
  • Maintaining privacy – In many cases, like clinical trials, data needs to be de-identified to protect patients and ensure results are not biased.
  • Source orchestration – Ingesting data from multiple sources creates a lot of overhead when monitoring data.
  • Domain knowledge – Processing and managing healthcare data requires industry-specific knowledge since the data is often subject to medical business logic.

Ensuring Data Quality as Early as Ingestion

To overcome these challenges, data engineers need to find methods for monitoring errors. Data engineers can ensure that any issues are captured early by getting the data ready at the ingestion point. This prevents corrupt data from reaching downstream users, assures regulation compliance, and ensures data arrives on time. Early detection also saves data engineers from having to rerun pipelines when issues are found.

How big is the detection difference? Early detection enables identifying issues within hours. Later in the pipeline, the same issue could take days to detect.

One recommended way to ensure and monitor data quality is through structure and automation. The ingestion pipeline includes the following steps (among others):

  • Extraction of data files from external sources
  • Consolidating any variations
  • Pipeline orchestration
  • Raw data ingestion
  • Unifying file formats
  • Validation

To enable automation and scalability, it is recommended to create a unified structure across all pipelines and enforce systematic conventions for each stage.

For example, collecting metadata like source identification, environment, data stream, and more. The conventions will be checked in the validation step before moving the data files downstream.

How to Deal with Data Quality Issues

The challenges of data-intensive ingesting sometimes require finding creative solutions. In the podcast this blog post is based on, Johannes describes the following scenario his data engineering team deals with constantly.

A common delivery issue in healthcare is data deliveries being late. Komodo Health’s systems had defined logic that matched the file’s date with the execution date. However, since files were often sent late, the dates didn’t match, and the pipeline wouldn’t find the file. This required the team to rerun the pipeline manually. To overcome this issue, the data engineering team changed the logic so that the pipeline picked up all files within the file’s timestamp. The late delivery was then automatically captured without needing manual intervention again.

In some cases, however, fixing issues requires going back to the source and asking the data engineering team to fix it. To minimize these cases and the friction they might cause, it’s recommended to create agreements to ensure everyone is on the same page when setting up the process. The agreement should include expectations, delivery standards, and SLAs, among others.

You can also make suggestions that will help with deliveries. For example, when deliveries have multiple files, ask the source to add a manifest file that states the number of files, the number of records for each file, and the last file being sent.

Catching issues and bad batches of data on time is very important since it could significantly impact downstream users. It is especially important to be cautious in healthcare since analyses and life and death decisions are being made based on the data.

Choosing the Right Tools for Healthcare Data Engineering

Data engineers in healthcare face multiple challenges and require tools to assist them. While some prefer homegrown tools that support flexibility, buying a tool can relieve some of the effort and free engineers up for dealing with data quality issues.

When choosing a tool, it’s recommended to:

  1. Determine non-negotiables – features and capabilities the tool has to support.
  2. Decide on nice-to-haves – abilities that could help and make your life easier.
  3. Understand the roadmap – to see which features are expected to be added and determine how much influence you have over it.

Whichever tool you choose, make sure to see a demo of it. To see a demo of a Databand, which enables data quality monitoring as early as ingestion, click here.
To learn more about data-intensive organizations and hear the entire episode this blog post was based on, visit our podcast, here.

Monitoring for data quality issues as early as ingestion: here’s why

2022-01-24 15:17:13

Monitoring for data quality issues as early as ingestion: here’s why

Maintaining data quality is challenging. Data is often unreliable and inconsistent, especially when it flows from multiple data sources. To deal with quality issues and prevent them from impacting your product and decision-making, you need to monitor your data flows. Monitoring helps identify schema changes, discover missing data, catch unusual levels of null records, fix failed pipelines, and more. In this blog post, we will explain why we recommend monitoring data starting at the source (ingestion), list three potential use cases for ingestion monitoring, and finish off with three best practices for data engineers to get started.

This blog post is based on the first episode of our podcast, “Why Data Quality Begins at the Source”, which you can listen to below or here.

Where Should You Monitor Data Quality?

Data quality monitoring is a fairly new practice, and different tools offer different monitoring capabilities across the data pipeline. While some tools monitor data quality where the data rests, i.e at the data warehouse, at Databand we think it’s critical to monitor data quality as early as ingestion, not just at the warehouse level. Let’s look at a few reasons why.

4 Reasons to Monitor Data Quality Early in the Data Pipeline

1. Higher Probability of Identifying Issues

Erroneous or abnormal data affects all the other data and analytics downstream. Once corrupt data has been ingested and flows to the data lake/warehouse, it might already be mixed up with healthy data and used in analyses. This makes it much more difficult to identify the errors and their source, because the dirty data can be “washed out” in the higher volumes of data that sits at rest

In fact, the ability to identify issues is based on an engineer or analyst knowing what the expected data results should be, recognizing a problematic anomaly, and diagnosing that this anomaly is the result of data and not a business change. When the corrupt data is such a small percentage of the entire data lake, this becomes even harder.

Why take the chance of overlooking errors and problems that could impact the product, your users, and decision making? By monitoring early in the pipeline, many issues can be avoided because you are monitoring a more targeted sample of data, and therefore able to create a more precise baseline for when data looks unusual.

data quality monitoring
2. Creating Confidence in the Warehouse

Analysts and additional stakeholders rely on the warehouse to make a wide variety of business and product decisions. Trusting warehouse data is essential for business agility and for making the right decisions. If data in the warehouse is “known” to have issues, stakeholders will not use it or trust it. This means the organization is not leveraging data to the full extent.

If the data warehouse is the heart of the customer-facing product, i.e the product relies almost entirely on data, then corrupt data could jeopardize the entire product’s adoption in the market.

By quality assuring the data before it arrives to the warehouse and the main analytical system, teams can improve confidence in that “trusted layer.”

3. Ability to Fix Issues Faster

By identifying data issues faster, data engineers have more time to react. They can identify causality and lineage, and fix the data or source to prevent any harmful impact that corrupt data could have. Trying to identify and fix full-blown issues in the product or after decision-making is much harder to do.

4. Enabling Data Source Governance

By analyzing, monitoring and identifying the point of inception, data engineers can identify a malfunctioning source and act to fix it. This provides better governance over sources, in real-time and in the long-run.

When Should You Monitor Data Quality from Ingestion?

We recommend monitoring data quality across the pipeline, from ingestion and at rests. However, you need to start somewhere… Here are the top three use cases for prioritizing monitoring at ingestion:

  • Frequent Source Changes – When your business relies on data sources or APIs where data structure frequently changes, it is recommended to continuously monitor them. For example, in the case of a transportation application that pulls data from the constantly changing APIs of location data, user tracking information, etc.
  • Multiple External Data Sources – When your business’s output depends on analyzing data from dozens or hundreds of sources. For example, a real-estate app that provides listings based on data from offices, municipalities, schools, etc.
  • Data-Driven Products – When your product is based on data and each data source has a direct impact on the product. For example, navigation applications that pull data about roads, weather, transportation, etc.

Getting Started with Data Quality Monitoring

As mentioned before, data quality monitoring is a relatively new practice. Therefore, it makes sense to implement it gradually. Here are three recommended best practices:

1. Determine Quality Layers

Data changes across the pipeline, and so does its quality. Divide your data pipeline into various steps, e.g the warehouse layer, the transformation layer, and the ingestion layer. Understand that data quality means different things at each of these stages and prioritize the layers that have the most impact on your business.

2. Monitor Different Quality Depths

When monitoring data, there are different quality aspects to review. Start with reviewing metadata and ensuring the data structure was correct and that all the data arrived. Once metadata has been verified, move on to address explicit business-related aspects of the data, which relate to domain knowledge.

3. See Demos of Different Data Monitoring Tools

Once you’ve mapped out your priorities and pain points, it’s time to find a tool that can automate this process for you. Don’t hesitate to see demos of different tools and ask the hard questions about data quality assurance. For example, to see a demo of Databand, click here. To learn more about data quality and hear the entire episode this blog post was based on, visit our podcast, here.

The ideal DataOps org structure

2021-08-27 15:13:37

The ideal DataOps org structure

The ideal data operations (DataOps) org structure

An organization’s external communications tend to reflect its internal ones. That’s what Melvin Conway taught us, and it applies to data engineering. If you don’t have a clearly defined data operations or “DataOps” team, your company’s data outputs will be just as messy as its inputs.

For this reason, you probably need a data operations team, and you need one organized correctly.

conways law org structure

So first let’s back up—what is data operations?

Data operations is the process of assembling the infrastructure to generate and process data, as well as maintain it. It’s also the name of the team that does (or should do) this work—data operations, or DataOps. What does DataOps do? Well, if your company maintains data pipelines, launching one team under this moniker to manage those pipelines can bring an element of organization and control that’s otherwise lacking.

DataOps isn’t just for companies that sell their data, either. Recent history has proven you need a data operations team no matter the provenance or use of that data. Internal customer or external customer, it’s all the same. You need one team to build (or let’s be real, inherit and then rebuild) the pipelines. They should be the same people (or, for many organizations, person) who implement observability and tracking tools and monitor the data quality across its four attributes.

And of course, the people who built the pipeline should be the same people who get the dreaded PagerDuty alert when a dashboard is down—not because it’s punitive, but because it’s educational. When they have skin in the game, people build differently. It’s good incentive and allows for better problem solving and speedier resolution.

Last but not least, that data operations team needs a mission—one that transcends simply “moving the data” from point A to point B. And that is why the “operations” part of their title is so important.

Data operations vs data management—what’s the difference?

Data operations is building resilient processes to move data for its intended purpose. All data should move for a reason. Often, that reason is revenue. If your data operations team can’t trace a clear line from that end objective, like the sales teams having better forecasts and making more money, to their pipeline management activities, you have a problem.

Without operations, problems will emerge as you scale:

  • Data duplication
  • Troubled collaboration
  • Waiting for data
  • Band-aids that will scar
  • Discovery issues
  • Disconnected tools
  • Logging inconsistencies
  • Lack of process
  • Lack of ownership & SLAs

If there’s a disconnect, you’re simply practicing plain old data management. Data management is the rote maintenance aspect of data operations. Which, while crucial, is not strategic. When you’re in maintenance mode you’re hunting down the reason for a missing column or pipeline failure and patching it up, but you don’t have time to plan and improve.

Your work becomes true “operations” when you transform trouble tickets into repeatable fixes. Like, for example, you find a transformation error coming from a partner, and you email them to get it fixed before it hits your pipeline. Or you implement an “alerts” banner on your executives’ dashboard that tells them when something is wrong so they know to wait for the refresh. Data operations, just like developer operations, aims to put repeatable, testable, explainable, intuitive systems in place that ultimately reduce effort for all.

That’s data operations vs data management. And so the question then becomes, how should that data operations team be structured?

Organizing principles for a high-performing data operations team structure

So let’s return to where we began—talking about how your system outputs reflect your organizational structure. If your data operations team is an “operations” team in name only, and mostly only maintains, you’ll probably receive a forever ballooning backlog of requests. You’ll rarely have time to come up for air to make long-term maintenance changes, like switching out a system or adjusting a process. You’re stuck in Jira or ServiceNow response hell. 

If, on the other hand, you’ve founded (or relaunched) your data operations team with strong principles and structure, you produce data that reflects your high-quality internal structure. Good data operations team structures produce good data.

Principle 1: Organize in full-stack functional work groups

Gather a data engineer, a data scientist, and an analyst into a group or “pod” and have them address things together they might have addressed separately. Invariably, these three perspectives lead to better decisions, less fence-tossing, and more foresight. For instance, rather than the data scientist writing a notebook that doesn’t make sense and passing it to the engineer only to create a back-and-forth loop, they and the analyst can talk through what they need and the engineer can explain how it should be done.
Lots of data operations teams already work this way. “Teams should aim to be staffed as ‘full-stack,’ so the necessary data engineering talent is available to take a long view of the data’s whole life cycle,” say Krishna Puttaswamy and Suresh Srinivas at Uber. And at the travel site Agoda, the engineering team uses pods for the same reason.

Principle 2: Publish an org chart for your data operations team structure

Do this even if you’re just one person. Each role is a “hat” that somebody must wear. To have a high-functioning data operation team, it helps to know which hat is where, and who’s the data owner for what. You also need to reduce each individual’s span of control to a manageable level. Maybe drawing it out like this helps you make the case for hiring. 

What is data operations team management? A layer of coordination on top of your pod structures who plays the role of servant leader. They project manage, coach, and unblock. Ideally, they are the most knowledgeable people on the team.

We’ve come up with our own ideal structure, pictured, though it’s a work in progress. What’s important to note is there’s one single person leading with a vision for the data (the VP). Below them are multiple leaders guiding various data disciplines towards that vision (the Directors), and below them, interdisciplinary teams who ensure data org and data features work together. (Credit to our Data Solution Architect, Michael Harper, for these ideas.)

data operations org structure chart

Principle 3: Publish a guiding document with a DataOps North Star metric

Picking a North Star metric helps everyone involved understand what they’re supposed to optimize for. Without such an agreement, you get disputes. Maybe your internal data “customers” complain that the data is slow. But the reason it’s slow is because you know their unstated desire is to optimize for quality first.

Common DataOps North Stars: Data quality, automation (repeatable processes), and process decentralization (aka end-user self-sufficiency).

Once you have a North Star, you can also decide on sub-metrics or sub-principles that point to that North Star, which is almost always a lagging indicator. 

Principle 4: Build in some cross-functional toe-stepping

Organize the team so different groups within it must frequently interact and ask other groups for things. These interactions can prove priceless. “Where the data scientists and engineers learn about how each other work, these teams are moving faster and producing more,” says Amir Arad, Senior Engineering Manager at Agoda. 

Amir says he finds one of the hidden values to a little cross-functional redundancy is you get people asking questions nobody on that team had thought to ask. 

“The engineering knowledge gap is actually kinda cool. It can lead to them asking us to simplify,” says Amir. “They might say, ‘But why can’t we do that?’ And sometimes, we go back and realize we don’t need that code or don’t need that server. Sometimes non-experts bring new things to the table.”

Principle 5: Build for self-service

Just as with DevOps, the best data operations teams are invisible, and constantly working to make themselves redundant. Rather than play the hero who likes to swoop in to save everybody, but ultimately makes the system fragile, play the servant leader. Aim to, as Lao Tzu put it, lead people to the solution in a way that gets them thinking, “We did it ourselves.” 

Treat your data operations team like a product team. Study your customer. Keep a backlog of fixes. Aim to make the tool useful enough that the data is actually used. 

Principle 6: Build in full data observability from day one

There is no such thing as “too early” for data monitoring and observability. The analogy that’s often used to excuse putting off monitoring is, “We’re building the plane while in flight.” Think about that visual. Doesn’t that tell you everything you need to know about your long-term survival? A much better analogy is plain old architecture. The longer you wait to assemble a foundation, the more costly it is to put in, and the more problems the lack of one creates.

Read: Data observability: Everything you need to know

Principle 7: Secure executive buy-in for long-term thinking

The decisions you make now with your data infrastructure will, as General Maximus put it, “Echo in eternity.” Today’s growth hack is tomorrow’s gargantuan, data-transforming internal system chaos nightmare. You need to secure executive support to make inconvenient but correct decisions, like telling everyone they need to pause the requests because you need a quarter to fix things.

Principle 8: Use the “CASE” method (with attribution)

CASE stands for “copy and steal everything,” a tongue-in-cheek way of saying, don’t build everything from scratch. There are so many useful microservices and open-source offerings today. Stand on the shoulders of giants and focus on building the 40% of your pipeline that actually needs to be custom, and doing it well.

If you do nothing else today, do this

Go have a look at the tickets in your backlog. How often are you reacting to rather than preempting problems? How many of the problems you’ve addressed had a clearly identifiable root cause? How many were you able to fix permanently? The more you preempt, the more you resemble a true data operations team. And, the more helpful you’ll find a data observability tool. Full visibility can help you make the transition from simply maintaining to actively improving. 

Teams that actively improve their structure actively improve their data. Internal harmony leads to external harmony, in a connection that’d make Melvin Conway proud.

What is Good Data Quality for Data Engineers?

2021-03-02 19:47:21

What is Good Data Quality for Data Engineers?

In theory, data quality is everyone’s problem. When it’s poor, it degrades marketing, product, customer success, brand-perception—everything. In theory, everyone should work together to fix it. But that’s in theory.

In reality, you need someone to take ownership of the problem, investigate it, and tell others what to do. That’s where data engineers come in.

In this guide, the Databand team has compiled a resource for grappling with data quality issues within and around your pipeline – not in theory, but in practice. And that starts with a discussion of what exactly constitutes data quality for data engineers.

Data quality challenges for data engineers

Their perennial challenge? That everyone involved in using the data has a different understanding of what “data” means. And it’s not really their fault.

The further someone is from the source of that data and the data pipelines that carry it, the more they tend to engage in magical thinking about how it can be used, if only for a lack of awareness. According to one data engineer we talked to when researching this guide, “Business leaders are always asking, ‘Hey, can we look at sales across this product category?’ when on the backend, it’s virtually impossible with the current architecture.”

The importance of observability

Similarly, businesses rely on the data from pipelines they can’t fully observe. Without accurate benchmarks or a seasoned professional who can sense that output values are off, you can be data-driven right off a cliff.

What are the four characteristics of data quality?

While academic conceptions of data quality provide an interesting foundation, we’ve found that for data engineers, it’s different. In diagnosing pipeline data quality issues for dozens of high-volume organizations over the last few years, engineers need a simpler and more credible map. Only with that map can you begin to conceptualize systems that will keep it in proper order.

We’ve condensed the typical 6-7 data quality dimensions (you will find hundreds of variants online) into just four:

  • Fitness
  • Lineage
  • Governance
  • Stability

We also prefer the term “data health” to “data quality,” because it suggests it’s an ongoing system that must be managed. Without checkups, pipelines can grow sick and stop working.

Dimension 1: Fitness

Is this data fit for its intended use?

The operative word here is “intended.” No two companies’ uses are identical, so fitness is always in the eye of the beholder. To test fitness, take a random sample of records and test how they perform for your intended use.

Within fitness, look at:

  • Accuracy—does the data reflect reality? (Within reason. As they say, all models are wrong. Some are useful.)
  • Integrity—does the fitness remain high through the data’s lifecycle? (It’s a simple equation: Integrity = quality / time)

Dimension 2: Lineage

Where did this data come from? When? Where did it change? Is it where it needs to be?

Lineage is your timeline. It helps you understand whether your data health problem starts with your provider. If it’s fit when it enters your pipeline and unfit when it exits, that’s useful information.

Within lineage, look at:

  • Source—is my data source provider behaving well? E.g. Did Facebook change an API?
  • Origin—where did the data already in my database come from? E.g. Perhaps you’re not sure who put it there.

Dimension 3: Governance

Can you control it?

These are the levers you can pull to move, restrict, or otherwise control what happens to your data. It’s the procedural stuff, like loads and transformations, but also security and access.

Within governance, look at:

  • Data controls—how do we identify which data should be governed and which should be open? What should be available to data scientists and users? What shouldn’t?
  • Data privacy—where is there currently personally identifiable info (PII)? Can we automatically redact PII like phone numbers? Can we ensure that a pipeline that accidentally contains PII fails or is killed?
  • Regulation—can we track regulatory requirements, ensure we’re compliant, and prove we’re compliant if a regulator wants to know? (Under GDPR, CCPA, NY SHIELD, etc.)
  • Security—who has access to the data? Can I control it? With enough granularity?

Dimension 4: Stability

Is the data complete and available in the right frequency?

Your data may be fit, meaning your downstream systems function, but is it as accurate as it could be, and is that consistently the case? If your data is fit, but the accuracy varies widely, or it’s only available in monthly batch updates and you need it hourly, it’s not stable.

Stability is one of the biggest areas where data observability tools can help. Pipelines are often a black box unless you can monitor what happens inside and get alerts.

To check stability, check against a benchmark dataset.

Within stability, look at:

  • Consistency—does the data going in match the data going out? If it appears in multiple places, does it mean the same thing? Are weird transformations happening at predictable points in the pipeline?
  • Dependability—the data is present when needed. E.g. If I build a dashboard, it behaves properly and I don’t get calls from leadership.
  • Timeliness—is it on time? E.g. If you pay NASDAQ for daily data, are they providing fresh data on a daily basis? Or is it an internal issue?
  • Bias—is there bias in the data? Is it representative of reality? Take, for example, seasonality in the data. If you train a model for predicting consumer buying behavior and you use a dataset from November to December, you’re going to have unrealistically high sales predictions.

Now, bias of this sort isn’t completely imperceptible—some observability platforms (Databand being one of them) have anomaly detection for this reason. When you have seasonality in your data, you have seasonality in your data requirements, and thus seasonality in your data pipeline behavior. You should be able to automatically account for that.

Quality data is balanced data

Good data quality for data engineers is when you have a data pipeline set up to ensure all four data quality dimensions: fitness, lineage, governance, and stability. But you must address all four.

As a data engineer, you cannot tackle one dimension of data quality without tackling all. That may seem rather inconvenient given that most engineers are inheriting data pipelines rather than building them from scratch. But such is the reality.

If you optimize for one dimension—say, stability—you may be loading data that hasn’t yet been properly transformed, and fitness can suffer. The data quality dimensions exist in equilibrium.

How to balance all four dimensions of data quality

graphic displaying the four dimensions of data quality: 1. Fitness 2. Lineage 3. Governance 4. Stability

To achieve a proper balance for data health, you need:

Data quality controls

What systems do you have for manipulating, protecting, and governing your data? With high-volume pipelines, it is not enough to trust and verify.

Data quality testing

What systems do you have for measuring fitness, lineage, governance, and stability? Things will break. You must know where, and why.

Systems to identify data quality issues

If issues do occur—if a pipeline fails to run, or the result is aberrant—do you have anomaly detection to alert you? Or if PII makes it into a pipeline, does the pipeline auto-fail to protect you from violating regulation?

In short, you need a high level of data observability, paired with the ability to act continuously.

Common data pipeline data quality issues

As a final thought, when you’re diagnosing your data pipeline issues, it’s important to draw a distinction between a problem and its root cause. Your pipeline may have failed to complete. The proximal cause could have been an error in a Spark job. But the root cause? A corruption in the dataset. If you aren’t addressing issues in the dataset, you’ll be forever addressing issues.

Examples of common data pipeline quality issues:

  • Non-unicode characters
  • Unexpected transforms
  • Mismatched data in a migration or replication process
  • Pipelines missing their SLA, or running late
  • Pipelines that are too resource-intensive or costly
  • Finding the root cause of issues
  • Error in a Spark job, corruption in a data set
  • A big change in your data volume or sizes

The more detail you get from your monitoring tool, the better. It’s common to discover proximal causes quickly, but then take days to discover the root cause through a taxing, manual investigation. Sometimes, your pipeline workflow management tool tells you everything is okay but a quick glance at the output reassures you nothing is okay, because the values are all blank. For instance, Airflow may tell you the pipeline succeeded, but no data actually passed through. Your code ran fine—Airflow gives you a green light, you’re good—but on the data level, it’s entirely unfit.

Constant checkups and being able to peer deeply into your pipeline to know the right balance of fitness, lineage, governance, and stability to produce high-quality data. And high-quality data is how you support an organization in practice, not just in theory.

Find and fix data health issues fast

Unify data observability for your entire tech stack so your team can build better performing and more reliable data products.