On-demand Webinar: How to use end-to-end data lineage to drive better actions

Watch now

What is Dark Data and How it Causes Data Quality Issues

Databand
2022-05-31 17:11:25

We’re all guilty of holding onto something that we’ll never use. Whether it’s old pictures on our phones, items around the house, or documents at work, there’s always that glimmer of thought that we just might need it one day.

It turns out businesses are no different. But in the business setting, it’s not called hoarding, it’s called dark data.

Simply put, dark data is any data that an organization acquires and stores during regular business activities that doesn’t actually get used in any way. No one analyzes it to gain insights, drive decisions, or make money – it just sits there.

Unfortunately, dark data can prove quite troublesome, causing a host of data quality issues. But it doesn’t have to be all bad. This article will explore what you need to know about dark data, including:

  • What is dark data
  • Why dark data is troublesome
  • How dark data causes data quality issues
  • The upside of dark data
  • Top tips to shine the light on dark data

What is dark data?

According to Gartner, dark data is “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships, and direct monetizing). Storing and securing data typically incurs more expense (and sometimes greater risk) than value.”

And most companies have a lot of dark data. Carnegie Mellon University finds that about 90% of most organizations’ data is dark data, to be exact.

How did this happen? A lot of organizations operate in silos, and this can easily lead to situations in which one department would make use of the data that another department captures, but they’re not even aware that data is getting captured (and therefore they’re not using it).

We also got here because not too long ago we had the idea that it’s valuable to store all the information we could possibly capture in a big data lake. As data became more and more valuable, we thought maybe one day that data would be important – so we should hold onto it. Plus, data storage is cheap, so it was okay if it sat there totally unused. 

But maybe it’s not as good an idea as we once thought.

Why is dark data troublesome?

If the data could be valuable one day and data storage is cheap, what’s the big issue with it? There are three problems to start

1) Liability

Often with dark data, companies don’t even know exactly what type of data they’re storing. And they could very well (and often do) have personally identifiable information sitting there without even realizing it. This could come from any number of places, such as transcripts from audio conversations with customers or data shared online. But regardless of the source, storing this data is a liability. 

A host of global privacy laws have been introduced over the past several years, and they apply to all data – even data that’s sitting unused in analytics repositories. As a result, it’s risky for companies to store this data (even if they’re not using it) because there’s a big liability if anyone accesses that information.

2) Accumulated costs

Data storage at the individual level might be cheap, but as companies continue to collect and store more and more data over time, those costs add up. Some studies show companies spend anywhere from $10,000 to $50,000 in storage just for dark data alone.

Getting rid of that data that’s not used for any purpose could then lead to significant cost savings. Savings that can be re-allocated to any number of more constructive (and less troublesome) purposes.

3) Opportunity costs

Finally, many companies are losing out on opportunities by not using this data. So while it’s good to get rid of data that’s actually not usable – due to risks and costs – it pays to first analyze what data is available.

In taking a closer look at their dark data, many companies may very well find that they can better manage and use that data to drive some interesting (and valuable!) insights about their customers or their own internal metrics. Hey, it’s worth a look.

How dark data causes data quality issues

Interestingly enough, sometimes dark data gets created because of data quality issues. Maybe it’s because incomplete or inaccurate data comes in, and therefore teams know they won’t use it for anything.

For example, perhaps it’s a transcript from an audio recording, but the AI that creates the transcript isn’t quite there yet and the result is rife with errors. Someone keeps the transcript though, thinking that they’ll resolve it at some point. This is an example of how data quality issues can create dark data.

In this way, it can often be used to understand the sources of bad data quality and the effects of that. Far too often, organizations aim to clean poor quality data, but they miss what’s causing the issue. And without that understanding, it’s impossible to fully resolve the data quality issue from continuing to happen.

When this happens, the situation becomes very cyclical, because rather than simply purging dark data that sits around without ever getting used, organizations let it continue to sit – and that contributes to growing data quality issues.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

  1. Analyze and identify the “as is” situation, including the current issues, existing data standards, and the business impact in order to prioritize the issue.
  2. Prevent bad data from recurring by evaluating the root cause of the issues and applying resources to tackle that problem in a sustainable way.
  3. Communicate often along the way, sharing what’s happening, what the team is doing, the impact of that work, and how those efforts connect to business goals.

The upside of dark data

But for all the data quality issues that dark data can (and, let’s be honest, does) cause, it’s not all bad. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”

Specifically, as data remains an extremely valuable asset, organizations must learn how to use everything they have to their advantage. In other words, that nagging thought that the data just might be useful one day could actually be true. Of course, that’s only the case if organizations actually know what to do with that data… otherwise it will continue to sit around and cause data quality issues.

The key to getting value out of dark data? Shining the light on it by breaking down silos, introducing tighter data management, and, in some cases, not being afraid to let data go.

Top tips to shine the light on dark data

When it comes to handling dark data and potentially using it to your organization’s advantage, there are several best practices to follow:

  1. Break down silos: Remember earlier when we said that dark data often comes about because of silos across teams? One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos instantly makes that data available to the team that needs it, and suddenly it goes from sitting around to providing immense value.
  2. Improve data management: Next, it’s important to really get a handle on what data exists. This starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize data better with the goal of making it easier for individuals across teams to find and use what they need.
  3. Introduce a data governance policy: Finally, introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and if so, how it should be organized to maintain clear data management), archived, or destroyed. An important part of this policy is being strict about what data should be destroyed. Enforcing that policy and regularly reviewing practices can help eliminate dark data that will never really be used.

It’s time to solve the dark data challenge and restore data quality

Dark data is a very real problem. Far too many organizations hold onto data that never gets used, and while it might not seem like a big deal, it is. It can create liabilities, significant storage costs, and data quality issues. It can also lead to missed opportunities due to teams not realizing what data is potentially available to them.

Taking a proactive approach to managing this data can turn the situation around. By shining the light on dark data, organizations can not only reduce liabilities and costs, but also give teams the resources they need to better access data and understand what’s worth saving and what’s not. And doing so will also improve data quality. It’s a no-brainer.