Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

What is Data Governance and Where Observability Fits In

Databand
2022-07-25 11:18:47

Data is the most valuable asset for most businesses today. Or at least it has the potential to be. But to realize the full value, organizations must manage their data correctly. This management covers everything from how it’s collected to how it’s maintained and analyzed. And a big component of that is data governance.

Data governance refers to the policies, processes, roles, and technology that businesses use to ensure data availability, usability, integrity, and security. This article will explore everything you need to know about data governance, including:

  • What is it?
  • What’s the difference between data governance vs. data management?
  • Why is it important?
  • What are the components of effective data governance?
  • What are the key roles involved in it?
  • What are its best practices?

What is Data Governance?

Data governance is a core component of any big data management strategy that organizations introduce to drive insights. Effective data governance ensures quality and consistency in the data used to power critical business decisions.

At a high level, it can refer to data roles and responsibilities, data accessibility, data policies and processes, data creation procedures, data flows, and more. Digging deeper, it defines the architecture for decision-making and access rights around data, answering questions like:

  • How do we define data?
  • Where does data come from?
  • How do we confirm the quality of data?
  • How do we use data?
  • Where do we store data?
  • How do we protect data?
  • How do we organize data?
  • How do we connect data across systems?
  • How do we maintain a current inventory of our data?
  • How accurate does that data inventory need to be?

Software for data governance can either be purpose-built or baked into applications that make up the modern data stack.

What’s the Difference Between Data Governance vs. Data Management?

Data governance and data management are often used interchangeably, however, the two terms refer to different practices.

It sets the strategy by introducing policies and procedures throughout the data lifecycle. Meanwhile, data management is the practice of enforcing those policies and procedures so that the data is ready for use.

In short, it is the cornerstone of all data management initiatives.

Why is Data Governance Important?

In today’s data-driven world, organizations need effective data governance to be able to trust in the quality and consistency of their data. 

A strong approach to data governance benefits the entire organization by giving individuals a clear way to access data, shared terminology to discuss data, and a standard way to understand data and make it meaningful.

Some of the key benefits of data governance include:

  • Introducing a clear data quality framework to bring together data and create a shared understanding for better insights and decisions
  • Improving consistency of data across systems and processes, for efficient data integration
  • Clearly defining policies and procedures around data-related activities to ensure standardization across the entire organization
  • Outlining roles and responsibilities in terms of data management and data access for clarity among stakeholders
  • Improving compliance by allowing for faster response and resolution to data incidents

On the flip side, poor data governance can hamper regulatory compliance initiatives, which can create problems for companies when it comes to satisfying new data privacy and protection laws.

What are the Components of Effective Data Governance?

In order for it to be effective, it must encompass several key components that support the follow-on data management activities. These components include:

Data Standards

It should set explicit data standards for consistency across the entire organization. These standards should assess and verify data quality and should be transparent to everyone in the company. As a result, they should help teams better comprehend and use data.

Data standards should also allow any third-party auditors to easily see how the organization handles sensitive data, how that data gets used, and why it gets used in that way. This transparency is essential for compliance, especially in the case of a data breach.

Data Integration

Data integration brings together data from diverse sources to make data more readily available and power deeper insights. Good data governance requires a complete understanding of how data gets integrated across systems and processes. Specifically, the data governance program should define the tools, policies, and procedures used to pass data across systems and combine information.

As a best practice, these data integration guidelines should be clear and easy to follow to ensure every new system adheres to them. Additionally, the team responsible for data governance should assist in reviewing these guidelines during any new technology implementations.

Data Security

Protecting the security of data is essential, as any unauthorized access to data or even loss of data can pose serious risks – from dangers to the subjects of data to financial loss to reputational damage. A data governance framework outlines a variety of elements related to data security, including where data is stored, how it’s accessed, and what level of availability it has.

Specifically, it should detail defenses like authentication tools and encryption algorithms that need to be implemented to protect the data network. Then, any teams working on data governance should partner closely with IT security to ensure adequate protection measures are in place based on those guidelines.

Data Lifecycle Management

Understanding the organization’s data lifecycle means knowing where data resides at any given time as it moves through systems until it eventually gets discarded. Good data governance allows you to quickly discover and isolate data at any point in the lifecycle.

This concept, also known as data lineage, allows analysts to trace data back to its source to confirm trustworthiness.

Data Observability

Data observability allows you to understand the health and state of data in your system to identify and resolve issues in near real-time. It includes a variety of activities that go beyond just describing the problem, providing context to also resolve the problem and work to prevent it from recurring.

Data governance helps set the framework for data observability, setting guidelines for what to monitor and when and what thresholds should set off alerts when something isn’t right. A good data observability platform can handle these activities, making it important to choose a platform that can meet the requirements for identifying, troubleshooting, and resolving problems outlined in your strategy. 

Metadata Management

Another critical component of data governance is metadata management, which focuses on maintaining consistent definitions of data across systems. This consistency is important to ensure data flows smoothly across integrated solutions and that everyone has a shared understanding of the data.

The framework should include details on data definition, data security, data usage, and data lineage. In doing so, it should make it possible to clearly identify and classify all types of data in a standardized way across the organization.

Data Stewardship

Data stewardship is the practice that guarantees your organization’s data is accessible, usable, secure, and trustworthy. While the data governance strategy determines your organization’s goals, risk tolerance, security standards, and strategic data needs to set high-level policies, data stewardship focuses on making sure those policies get implemented.

To achieve this follow-through, data stewardship assigns clear roles and responsibilities for various initiatives outlined in the strategy. 

What are the Key Roles Involved in Data Governance?

Data governance programs can only succeed if they have clearly defined roles and responsibilities. As a result, it’s important to identify the right people within your organization to take on this ownership and establish their roles in the program.

Specifically, every data governance program requires people in three critical roles, each of which must be filled with qualified individuals who understand their specific responsibilities and how they contribute to the bigger picture. These roles include:

Chief Data Officer

The Chief Data Officer is the data governance leader. This person is responsible for overseeing the entire program, including enforcing and implementing all policies and procedures and leading the data committee and data stewards.

Data Committee

The data committee is a group of individuals that sets data governance policies and procedures, including rules for how data gets used and who can access it. They also resolve any disputes that arise regarding data usage or its role within the organization. The committee’s purpose is to promote data quality and ensure that data owners and data stewards have what they need at every point in the data lifecycle to do their jobs effectively.

Data Stewards

The data stewards are responsible for carrying out the data governance policies set by the data committee. They oversee data, making sure everything adheres to policies throughout the entire data lifecycle from creation to archival. The data stewards also train new staff on policies. 

In some cases, data stewards might also be the data owners. In other cases, those might be two separate groups. Either way, the data owners are the people who manage the systems that create and house data.

What are Data Governance Best Practices?

When it comes to getting data governance off the ground (or improving what your organization already has in place) there are several best practices to consider:

Get Buy-In from the Top

As with any initiative, buy-in for data governance needs to start at the top. This top-down buy-in is important to make sure that everyone in the organization adheres to data governance policies and that those who are in a position to influence that acceptance understand the importance of your work. 

To achieve this buy-in, share with executives how your data governance plan can help them realize their strategic objectives. The more you can highlight the advantages of the program and how it relates to their work, the easier it will be.

Communicate Often

Communication beyond top-level executives is essential to effective data governance. To ensure everyone is aware of what your team is doing around data governance and why it matters, make a list of everyone in the organization who has a stake in or would be affected by that work. 

Then establish regular communications to share updates about program changes, roadblocks, and successes, that way everyone knows where to go for updates and can stay informed on a regular basis.

Combine Long-Term Goals with Short-Term Gains

When it comes to data governance, you won’t be able to tackle everything at once. Instead, it should be a continuous effort to support data-driven decision-making and open up new opportunities for people throughout the organization. 

As a result, your long-term plan needs to include smaller, short-term initiatives that you can weave into the day-to-day operations of your company for immediate wins. This approach ensures that you see progress quickly and can help uncover any potential roadblocks faster. It also opens the door to new ideas that can even improve your long-term plan.

Assign Clear Responsibility – and Train People Accordingly

You can’t simply assign someone the role of data steward and hope for the best. You need to make sure that anyone playing a role in your data governance program takes their part seriously, and that means you need to take their responsibilities just as seriously. 

This means you need to be clear about the responsibilities that data stewards and data committee members take on and offer training to support those people in their data governance roles. This training should cover everything from why it is so important to what’s expected from people in different roles.

Audit Process Adoption

A big part of data governance involves developing processes for how the company will handle data, especially when it comes to sensitive information. Auditing how these processes are actually living in your organization and how well people are adopting them can be extremely informative as you continue to make program improvements. 

That’s because even the best processes won’t do your organization any good if no one adheres to them.

Regularly Measure Progress and Keep an Eye Toward Improvements

Finally, remember that data governance is not a one-and-done effort. It’s a program that must continuously evolve based on factors like adoption and changing business needs. 

As a result, it’s important to regularly check in on how policies are faring and the impact on data quality. The more you can measure that progress, the better you can manage the situation and identify what’s working well and what needs to be improved.

Resolve problems outlined in your data governance strategy.

What is Dark Data and How it Causes Data Quality Issues

Databand
2022-05-31 17:11:25

We’re all guilty of holding onto something that we’ll never use. Whether it’s old pictures on our phones, items around the house, or documents at work, there’s always that glimmer of thought that we just might need it one day.

It turns out businesses are no different. But in the business setting, it’s not called hoarding, it’s called dark data.

Simply put, dark data is any data that an organization acquires and stores during regular business activities that doesn’t actually get used in any way. No one analyzes it to gain insights, drive decisions, or make money – it just sits there.

Unfortunately, dark data can prove quite troublesome, causing a host of data quality issues. But it doesn’t have to be all bad. This article will explore what you need to know about dark data, including:

  • What is dark data
  • Why dark data is troublesome
  • How dark data causes data quality issues
  • The upside of dark data
  • Top tips to shine the light on dark data

What is dark data?

According to Gartner, dark data is “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships, and direct monetizing). Storing and securing data typically incurs more expense (and sometimes greater risk) than value.”

And most companies have a lot of dark data. Carnegie Mellon University finds that about 90% of most organizations’ data is dark data, to be exact.

How did this happen? A lot of organizations operate in silos, and this can easily lead to situations in which one department would make use of the data that another department captures, but they’re not even aware that data is getting captured (and therefore they’re not using it).

We also got here because not too long ago we had the idea that it’s valuable to store all the information we could possibly capture in a big data lake. As data became more and more valuable, we thought maybe one day that data would be important – so we should hold onto it. Plus, data storage is cheap, so it was okay if it sat there totally unused. 

But maybe it’s not as good an idea as we once thought.

Why is dark data troublesome?

If the data could be valuable one day and data storage is cheap, what’s the big issue with it? There are three problems to start

1) Liability

Often with dark data, companies don’t even know exactly what type of data they’re storing. And they could very well (and often do) have personally identifiable information sitting there without even realizing it. This could come from any number of places, such as transcripts from audio conversations with customers or data shared online. But regardless of the source, storing this data is a liability. 

A host of global privacy laws have been introduced over the past several years, and they apply to all data – even data that’s sitting unused in analytics repositories. As a result, it’s risky for companies to store this data (even if they’re not using it) because there’s a big liability if anyone accesses that information.

2) Accumulated costs

Data storage at the individual level might be cheap, but as companies continue to collect and store more and more data over time, those costs add up. Some studies show companies spend anywhere from $10,000 to $50,000 in storage just for dark data alone.

Getting rid of that data that’s not used for any purpose could then lead to significant cost savings. Savings that can be re-allocated to any number of more constructive (and less troublesome) purposes.

3) Opportunity costs

Finally, many companies are losing out on opportunities by not using this data. So while it’s good to get rid of data that’s actually not usable – due to risks and costs – it pays to first analyze what data is available.

In taking a closer look at their dark data, many companies may very well find that they can better manage and use that data to drive some interesting (and valuable!) insights about their customers or their own internal metrics. Hey, it’s worth a look.

How dark data causes data quality issues

Interestingly enough, sometimes dark data gets created because of data quality issues. Maybe it’s because incomplete or inaccurate data comes in, and therefore teams know they won’t use it for anything.

For example, perhaps it’s a transcript from an audio recording, but the AI that creates the transcript isn’t quite there yet and the result is rife with errors. Someone keeps the transcript though, thinking that they’ll resolve it at some point. This is an example of how data quality issues can create dark data.

In this way, it can often be used to understand the sources of bad data quality and the effects of that. Far too often, organizations aim to clean poor quality data, but they miss what’s causing the issue. And without that understanding, it’s impossible to fully resolve the data quality issue from continuing to happen.

When this happens, the situation becomes very cyclical, because rather than simply purging dark data that sits around without ever getting used, organizations let it continue to sit – and that contributes to growing data quality issues.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

  1. Analyze and identify the “as is” situation, including the current issues, existing data standards, and the business impact in order to prioritize the issue.
  2. Prevent bad data from recurring by evaluating the root cause of the issues and applying resources to tackle that problem in a sustainable way.
  3. Communicate often along the way, sharing what’s happening, what the team is doing, the impact of that work, and how those efforts connect to business goals.

The upside of dark data

But for all the data quality issues that dark data can (and, let’s be honest, does) cause, it’s not all bad. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”

Specifically, as data remains an extremely valuable asset, organizations must learn how to use everything they have to their advantage. In other words, that nagging thought that the data just might be useful one day could actually be true. Of course, that’s only the case if organizations actually know what to do with that data… otherwise it will continue to sit around and cause data quality issues.

The key to getting value out of dark data? Shining the light on it by breaking down silos, introducing tighter data management, and, in some cases, not being afraid to let data go.

Top tips to shine the light on dark data

When it comes to handling dark data and potentially using it to your organization’s advantage, there are several best practices to follow:

  1. Break down silos: Remember earlier when we said that dark data often comes about because of silos across teams? One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos instantly makes that data available to the team that needs it, and suddenly it goes from sitting around to providing immense value.
  2. Improve data management: Next, it’s important to really get a handle on what data exists. This starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize data better with the goal of making it easier for individuals across teams to find and use what they need.
  3. Introduce a data governance policy: Finally, introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and if so, how it should be organized to maintain clear data management), archived, or destroyed. An important part of this policy is being strict about what data should be destroyed. Enforcing that policy and regularly reviewing practices can help eliminate dark data that will never really be used.

It’s time to solve the dark data challenge and restore data quality

Dark data is a very real problem. Far too many organizations hold onto data that never gets used, and while it might not seem like a big deal, it is. It can create liabilities, significant storage costs, and data quality issues. It can also lead to missed opportunities due to teams not realizing what data is potentially available to them.

Taking a proactive approach to managing this data can turn the situation around. By shining the light on dark data, organizations can not only reduce liabilities and costs, but also give teams the resources they need to better access data and understand what’s worth saving and what’s not. And doing so will also improve data quality. It’s a no-brainer.

What is Good Data Quality for Data Engineers?

Databand
2021-03-02 19:47:21

In theory, data quality is everyone’s problem. When it’s poor, it degrades marketing, product, customer success, brand-perception—everything. In theory, everyone should work together to fix it. But that’s in theory. 

In reality, you need someone to take ownership of the problem, investigate it, and tell others what to do. That’s where data engineers come in.

In this guide, the Databand team has compiled a resource for grappling with data quality issues within and around your pipeline – not in theory, but in practice. And that starts with a discussion of what exactly constitutes data quality for data engineers. 

Data quality challenges for data engineers

Their perennial challenge? That everyone involved in using the data has a different understanding of what “data” means. And it’s not really their fault.

The further someone is from the source of that data and the data pipelines that carry it, the more they tend to engage in magical thinking about how it can be used, if only for a lack of awareness. According to one data engineer we talked to when researching this guide, “Business leaders are always asking, ‘Hey, can we look at sales across this product category?’ when on the backend, it’s virtually impossible with the current architecture.”

The importance of observability

Similarly, businesses rely on the data from pipelines they can’t fully observe. Without accurate benchmarks or a seasoned professional who can sense that output values are off, you can be data-driven right off a cliff.

What are the four characteristics of data quality?

While academic conceptions of data quality provide an interesting foundation, we’ve found that for data engineers, it’s different. In diagnosing pipeline data quality issues for dozens of high-volume organizations over the last few years, engineers need a simpler and more credible map. Only with that map can you begin to conceptualize systems that will keep it in proper order.

We’ve condensed the typical 6-7 data quality dimensions (you will find hundreds of variants online) into just four:

  • Fitness
  • Lineage
  • Governance
  • Stability

We also prefer the term “data health” to “data quality,” because it suggests it’s an ongoing system that must be managed. Without checkups, pipelines can grow sick and stop working.

Dimension 1: Fitness

Is this data fit for its intended use?

The operative word here is “intended.” No two companies’ uses are identical, so fitness is always in the eye of the beholder. To test fitness, take a random sample of records and test how they perform for your intended use.

Within fitness, look at:

  • Accuracy—does the data reflect reality? (Within reason. As they say, all models are wrong. Some are useful.)
  • Integrity—does the fitness remain high through the data’s lifecycle? (It’s a simple equation: Integrity = quality / time)

Dimension 2: Lineage

Where did this data come from? When? Where did it change? Is it where it needs to be?

Lineage is your timeline. It helps you understand whether your data health problem starts with your provider. If it’s fit when it enters your pipeline and unfit when it exits, that’s useful information. 

Within lineage, look at:

  • Source—is my data source provider behaving well? E.g. Did Facebook change an API?
  • Origin—where did the data already in my database come from? E.g. Perhaps you’re not sure who put it there.

Dimension 3: Governance

Can you control it? 

These are the levers you can pull to move, restrict, or otherwise control what happens to your data. It’s the procedural stuff, like loads and transformations, but also security and access. 

Within governance, look at:

  • Data controls—how do we identify which data should be governed and which should be open? What should be available to data scientists and users? What shouldn’t?
  • Data privacy—where is there currently personally identifiable info (PII)? Can we automatically redact PII like phone numbers? Can we ensure that a pipeline that accidentally contains PII fails or is killed?
  • Regulation—can we track regulatory requirements, ensure we’re compliant, and prove we’re compliant if a regulator wants to know? (Under GDPR, CCPA, NY SHIELD, etc.)
  • Security—who has access to the data? Can I control it? With enough granularity?

Dimension 4: Stability

Is the data complete and available in the right frequency?

Your data may be fit, meaning your downstream systems function, but is it as accurate as it could be, and is that consistently the case? If your data is fit, but the accuracy varies widely, or it’s only available in monthly batch updates and you need it hourly, it’s not stable. 

Stability is one of the biggest areas where data observability tools can help. Pipelines are often a black box unless you can monitor what happens inside and get alerts.

To check stability, check against a benchmark dataset. 

Within stability, look at:

  • Consistency—does the data going in match the data going out? If it appears in multiple places, does it mean the same thing? Are weird transformations happening at predictable points in the pipeline?
  • Dependability—the data is present when needed. E.g. If I build a dashboard, it behaves properly and I don’t get calls from leadership.
  • Timeliness—is it on time? E.g. If you pay NASDAQ for daily data, are they providing fresh data on a daily basis? Or is it an internal issue?
  • Bias—is there bias in the data? Is it representative of reality? Take, for example, seasonality in the data. If you train a model for predicting consumer buying behavior and you use a dataset from November to December, you’re going to have unrealistically high sales predictions.

Now, bias of this sort isn’t completely imperceptible—some observability platforms (Databand being one of them) have anomaly detection for this reason. When you have seasonality in your data, you have seasonality in your data requirements, and thus seasonality in your data pipeline behavior. You should be able to automatically account for that.

Quality data is balanced data

Good data quality for data engineers is when you have a data pipeline set up to ensure all four data quality dimensions: fitness, lineage, governance, and stability. But you must address all four.

As a data engineer, you cannot tackle one dimension of data quality without tackling all. That may seem rather inconvenient given that most engineers are inheriting data pipelines rather than building them from scratch. But such is the reality. 

If you optimize for one dimension—say, stability—you may be loading data that hasn’t yet been properly transformed, and fitness can suffer. The data quality dimensions exist in equilibrium.

How to balance all four dimensions of data quality

graphic displaying the four dimensions of data quality: 1. Fitness 2. Lineage 3. Governance 4. Stability

To achieve a proper balance for data health, you need:

Data quality controls

What systems do you have for manipulating, protecting, and governing your data? With high-volume pipelines, it is not enough to trust and verify.

Data quality testing

What systems do you have for measuring fitness, lineage, governance, and stability? Things will break. You must know where, and why. 

Systems to identify data quality issues

If issues do occur—if a pipeline fails to run, or the result is aberrant—do you have anomaly detection to alert you? Or if PII makes it into a pipeline, does the pipeline auto-fail to protect you from violating regulation?

In short, you need a high level of data observability, paired with the ability to act continuously.

Common data pipeline data quality issues

As a final thought, when you’re diagnosing your data pipeline issues, it’s important to draw a distinction between a problem and its root cause. Your pipeline may have failed to complete. The proximal cause could have been an error in a Spark job. But the root cause? A corruption in the dataset. If you aren’t addressing issues in the dataset, you’ll be forever addressing issues.

Examples of common data pipeline quality issues: 

  • Non-unicode characters
  • Unexpected transforms
  • Mismatched data in a migration or replication process
  • Pipelines missing their SLA, or running late
  • Pipelines that are too resource-intensive or costly
  • Finding the root cause of issues
  • Error in a Spark job, corruption in a data set
  • A big change in your data volume or sizes

The more detail you get from your monitoring tool, the better. It’s common to discover proximal causes quickly, but then take days to discover the root cause through a taxing, manual investigation. Sometimes, your pipeline workflow management tool tells you everything is okay but a quick glance at the output reassures you nothing is okay, because the values are all blank. For instance, Airflow may tell you the pipeline succeeded, but no data actually passed through. Your code ran fine—Airflow gives you a green light, you’re good—but on the data level, it’s entirely unfit.

Constant checkups and being able to peer deeply into your pipeline to know the right balance of fitness, lineage, governance, and stability to produce high-quality data. And high-quality data is how you support an organization in practice, not just in theory. 

Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. Sign up for a free trial or request a demo to learn more.