Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

The Impact of Bad Data and Why Observability is Now Imperative

2022-06-02 14:09:07

Think the impact of bad data is just a minor inconvenience? Think again. 

Bad data cost Unity, a publicly-traded video game software development company, $110 million.

And that’s only the tip of the iceberg.

The Impact of Bad Data: A Case Study on Unity

Unity stock dropped 37% on May 11, 2022, after the company announced its first-quarter earnings, despite strong revenue growth, decent margins, good customer growth, and continued high performance in dollar-based net expansion. 

But there was one data point in Unity’s earnings that were not as positive. 

The company also shared that its Operate revenue growth was still up but had slowed due to a fault in its platform that reduced the accuracy of its Audience Pinpointer tool.

The fault in Unity’s platform? Bad data

Unity ingested bad data from a large customer into its machine learning algorithm, which helps place ads and allows users to monetize their games. This not only resulted in decreased growth but also ruined the algorithm, forcing the company to fix it to remedy the problem going forward.

The company’s management estimated the impact on the business at approximately $110 million in 2022.

Unity Isn’t Alone: The Impact of Bad Data is Everywhere

Unity isn’t the only company that has felt the impact of bad data deeply.

Take Twitter.

On April 25, 2022, Twitter accepted a deal to be purchased by Tesla and SpaceX founder Elon Musk. A mere 18 days later, Musk shared that the deal was “on hold” as he confirmed the number of fake accounts and bots on the platform. 

What ensued demonstrates the deep impact of bad data on this extremely high-profile deal for one of the world’s most widely-used speech platforms. Notably, Twitter has battled this data problem for years. In 2017, Twitter admitted to overstating its user base for several years, and in 2016 a troll farm used more than 50,000 bots to try to sway the US presidential election. Twitter first acknowledged fake accounts during its 2013 IPO.

Now, this data issue is coming to a head, with Musk investigating Twitter’s claim that fake accounts represent less than 5% of the company’s user base and angling to reduce the previously agreed upon purchase price as a result.

Twitter, like Unity, is another high-profile example of the impact of bad data, but examples like this are everywhere – and it costs companies millions of dollars. 

Gartner estimates that bad data costs companies nearly $13 million per year, although many don’t even realize the extent of the impact. Meanwhile, Harvard Business Review finds that knowledge workers spend about half of their time fixing data issues. Just imagine how much effort they could devote elsewhere if issues weren’t so prevalent.

Overall, bad data can lead to missed revenue opportunities, inefficient operations, and poor customer experiences, among other issues that add up to that multi-million dollar price tag.

Why Observability is Now Imperative for the C-Suite

The fact that bad data costs companies millions of dollars each year is bad enough. The fact that many companies don’t even realize this because they don’t measure the impact is potentially even worse. After all, how can you ever fix something of which you’re not fully aware?

Getting ahead of bad data issues requires data observability, which encompasses the ability to understand the health of data in your systems. Data observability is the only way that organizations can truly understand not only the impact of any bad data but also the causes of it – both of which are imperative to fixing the situation and stemming the impact.

It’s also important to embed data observability at every point possible with the goal of finding issues sooner in the pipeline rather than later because the further those issues progress, the more difficult (and more expensive) they become to fix.

Critically, this observability must be an imperative for C-suite leaders, as bad data can have a serious impact on company revenue (just ask Unity and Twitter). Making data observability a priority for the C-suite will help the entire organization – not just data teams – rally around this all-important initiative and make sure it becomes everyone’s responsibility.

This focus on end-to-end data observability can ultimately help:

  • Identify data issues earlier on in the data pipeline to stem their impact on other areas of the platform and/or business
  • Pinpoint data issues more quickly after the pop up to help arrive at solutions faster
  • Understand the extent of data issues that exist to get a complete picture of the business impact

In turn, this visibility can help companies recover more revenue faster by taking the necessary steps to mitigate bad data. Hopefully, the end result is a fix before the issues end up costing millions of dollars. And the only way to make that happen is if everyone, starting with the C-suite, prioritizes data observability.

impact of bad data

What is Dark Data and How it Causes Data Quality Issues

2022-05-31 17:11:25

We’re all guilty of holding onto something that we’ll never use. Whether it’s old pictures on our phones, items around the house, or documents at work, there’s always that glimmer of thought that we just might need it one day.

It turns out businesses are no different. But in the business setting, it’s not called hoarding, it’s called dark data.

Simply put, dark data is any data that an organization acquires and stores during regular business activities that doesn’t actually get used in any way. No one analyzes it to gain insights, drive decisions, or make money – it just sits there.

Unfortunately, dark data can prove quite troublesome, causing a host of data quality issues. But it doesn’t have to be all bad. This article will explore what you need to know about dark data, including:

  • What is dark data
  • Why dark data is troublesome
  • How dark data causes data quality issues
  • The upside of dark data
  • Top tips to shine the light on dark data

What is dark data?

According to Gartner, dark data is “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships, and direct monetizing). Storing and securing data typically incurs more expense (and sometimes greater risk) than value.”

And most companies have a lot of dark data. Carnegie Mellon University finds that about 90% of most organizations’ data is dark data, to be exact.

How did this happen? A lot of organizations operate in silos, and this can easily lead to situations in which one department would make use of the data that another department captures, but they’re not even aware that data is getting captured (and therefore they’re not using it).

We also got here because not too long ago we had the idea that it’s valuable to store all the information we could possibly capture in a big data lake. As data became more and more valuable, we thought maybe one day that data would be important – so we should hold onto it. Plus, data storage is cheap, so it was okay if it sat there totally unused. 

But maybe it’s not as good an idea as we once thought.

Why is dark data troublesome?

If the data could be valuable one day and data storage is cheap, what’s the big issue with it? There are three problems to start

1) Liability

Often with dark data, companies don’t even know exactly what type of data they’re storing. And they could very well (and often do) have personally identifiable information sitting there without even realizing it. This could come from any number of places, such as transcripts from audio conversations with customers or data shared online. But regardless of the source, storing this data is a liability. 

A host of global privacy laws have been introduced over the past several years, and they apply to all data – even data that’s sitting unused in analytics repositories. As a result, it’s risky for companies to store this data (even if they’re not using it) because there’s a big liability if anyone accesses that information.

2) Accumulated costs

Data storage at the individual level might be cheap, but as companies continue to collect and store more and more data over time, those costs add up. Some studies show companies spend anywhere from $10,000 to $50,000 in storage just for dark data alone.

Getting rid of that data that’s not used for any purpose could then lead to significant cost savings. Savings that can be re-allocated to any number of more constructive (and less troublesome) purposes.

3) Opportunity costs

Finally, many companies are losing out on opportunities by not using this data. So while it’s good to get rid of data that’s actually not usable – due to risks and costs – it pays to first analyze what data is available.

In taking a closer look at their dark data, many companies may very well find that they can better manage and use that data to drive some interesting (and valuable!) insights about their customers or their own internal metrics. Hey, it’s worth a look.

How dark data causes data quality issues

Interestingly enough, sometimes dark data gets created because of data quality issues. Maybe it’s because incomplete or inaccurate data comes in, and therefore teams know they won’t use it for anything.

For example, perhaps it’s a transcript from an audio recording, but the AI that creates the transcript isn’t quite there yet and the result is rife with errors. Someone keeps the transcript though, thinking that they’ll resolve it at some point. This is an example of how data quality issues can create dark data.

In this way, it can often be used to understand the sources of bad data quality and the effects of that. Far too often, organizations aim to clean poor quality data, but they miss what’s causing the issue. And without that understanding, it’s impossible to fully resolve the data quality issue from continuing to happen.

When this happens, the situation becomes very cyclical, because rather than simply purging dark data that sits around without ever getting used, organizations let it continue to sit – and that contributes to growing data quality issues.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

  1. Analyze and identify the “as is” situation, including the current issues, existing data standards, and the business impact in order to prioritize the issue.
  2. Prevent bad data from recurring by evaluating the root cause of the issues and applying resources to tackle that problem in a sustainable way.
  3. Communicate often along the way, sharing what’s happening, what the team is doing, the impact of that work, and how those efforts connect to business goals.

The upside of dark data

But for all the data quality issues that dark data can (and, let’s be honest, does) cause, it’s not all bad. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”

Specifically, as data remains an extremely valuable asset, organizations must learn how to use everything they have to their advantage. In other words, that nagging thought that the data just might be useful one day could actually be true. Of course, that’s only the case if organizations actually know what to do with that data… otherwise it will continue to sit around and cause data quality issues.

The key to getting value out of dark data? Shining the light on it by breaking down silos, introducing tighter data management, and, in some cases, not being afraid to let data go.

Top tips to shine the light on dark data

When it comes to handling dark data and potentially using it to your organization’s advantage, there are several best practices to follow:

  1. Break down silos: Remember earlier when we said that dark data often comes about because of silos across teams? One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos instantly makes that data available to the team that needs it, and suddenly it goes from sitting around to providing immense value.
  2. Improve data management: Next, it’s important to really get a handle on what data exists. This starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize data better with the goal of making it easier for individuals across teams to find and use what they need.
  3. Introduce a data governance policy: Finally, introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and if so, how it should be organized to maintain clear data management), archived, or destroyed. An important part of this policy is being strict about what data should be destroyed. Enforcing that policy and regularly reviewing practices can help eliminate dark data that will never really be used.

It’s time to solve the dark data challenge and restore data quality

Dark data is a very real problem. Far too many organizations hold onto data that never gets used, and while it might not seem like a big deal, it is. It can create liabilities, significant storage costs, and data quality issues. It can also lead to missed opportunities due to teams not realizing what data is potentially available to them.

Taking a proactive approach to managing this data can turn the situation around. By shining the light on dark data, organizations can not only reduce liabilities and costs, but also give teams the resources they need to better access data and understand what’s worth saving and what’s not. And doing so will also improve data quality. It’s a no-brainer.

The Data Value Chain: Data Observability’s Missing Link

2022-05-18 13:18:01

Data observability is an exploding category. It seems like there is news of another data observability tool receiving funding, an existing tool is announcing expanded functionality, and many new products in the category are being dreamt up. After a bit of poking around, you’ll notice that many of them claim to do the same thing: end-to-end data observability. But what does that really mean and what’s a data value chain?

For data analysts, end-to-end data observability feels like having monitoring capabilities for their warehouse tables  — and if they’re lucky, they have some monitoring for the pipelines that move the data to and from their warehouse as well.

The story is a lot more complicated for many other organizations that are more heavily skewed towards data engineering. For them, that isn’t end-to-end data observability. That’s “The End” data observability. Meaning: this level of observability only gives visibility into the very end of the data’s lifecycle. This is where the data value chain becomes an important concept.

For many data products, data quality is determined from the very beginning; when data is first extracted and enters your system. Therefore, shifting data observability left of the warehouse is the best way to move your data operations out of a reactive data quality management framework, to a proactive one.

data observability

What is the Data Value Chain?

When people think of data, they often think of it as a static object; a point on a chart, a number in a dashboard, or a value in a table. But the truth is data is constantly changing and transforming throughout its lifecycle. And that means what you define as “good data quality” is different for each stage of that lifecycle.

“Good” data quality in a warehouse might be defined by its uptime. Going to the preceding stage in the life cycle, that definition changes. Data quality might be defined by its freshness and format. Therefore, your data’s quality isn’t some static binary. It’s highly dependent on whether things went as expected in the preceding step of its lifecycle.

Shani Keynan, our Product Director, calls this concept the data value chain.

“From the time data is ingested, it’s moving and transforming. So, only looking at the data tables in your warehouse or your data’s source, or only looking at your data pipelines, it just doesn’t make a lot of sense. Looking only at one of those, you don’t have any context.

You need to look at the data’s entire journey. The thing is, when you’re a data-intensive company who’s using lots of external APIs and data sources, that’s a large part of the journey. The more external sources you have, the more vulnerable you are to changes you can’t predict or control. Covering the hard ground first, at the data’s extraction, makes it easier to catch and resolve problems faster since everything downstream depends on those deliveries.”

The question of whether data will drive value for your business is defined by aseries of If-Then statements:

  1. If data has been ingested correctly from our data sources, then our data will be delivered to our lake as expected.
  2. If data is delivered & grouped in our lake as expected, then our data will be able to be aggregated & delivered to our data warehouse as expected.
  3. If data is aggregated & delivered to our data warehouse as expected, then the data in our warehouse can be transformed.
  4. If data in our warehouse can be transformed correctly, then our data will be able to be queried and will provide value for the business.
data warehouse

Let us be clear: this is an oversimplification of the data’s life cycle. That said, it illustrates how having observability only for the tables in your warehouse & the downstream pipelines leaves you in a position of blind faith.

In the ideal world, you would be able to set up monitoring capabilities & data health checkpoints everywhere in your system. This is no small project for most data-intensive organizations; some would even argue it’s impractical.

Realistically, one of the best places to start your observability initiative is at the beginning of the data value chain; at the data extraction layer.

Data Value Chain + Shift-left Data Observability

If you are one of these data-driven organizations, how do you set your data team up for success?

While it’s important to have observability of the critical “checkpoints” within your system, the most important checkpoint you can have is at the data collection process. There are two reasons for that:

#1 – Ingesting data from external sources is one of the most vulnerable stages in your data model.

As a data engineer, you have some degree of control over your data & your architecture. But what you don’t control is your external data sources. When you have a data product that depends on external data arriving on time to function, that is an extremely painful experience.

This is best highlighted in an example. Let’s say you are running a large real estate platform called Willow. Willow is a marketplace where users can search for homes and apartments to buy & rent across the United States.

Willow’s goal is to give users all the information they need to make a buying decision; things like listing price, walkability scores, square footage, traffic scores, crime & safety ratings, school system ratings, etc.

In order to calculate “Traffic Score” for just one state in the US, Willow might need to ingest data from 3 external data sources. There are 50 states, so that means you suddenly have 150 external data sources you need to manage. And that’s just for one of your metrics.

Here’s where the pain comes in: You don’t control these sources. You don’t get a say whether they decide to change their API to better fit their data model. You don’t get to decide whether they drop a column from your dataset. You can’t control if they miss one of their data deliveries and leave you hanging.

All of these factors put your carefully crafted data model at risk. All of them can break your pipelines downstream that follow strictly coded logic. And there’s really nothing you can do about it except catching it as early as you can.

Having data observability in your data warehouse doesn’t so much to solve this problem. It might alert you that there is bad data in your warehouse, but by that point, it’s already too late.

This brings us to our next point…

#2 – It makes the most sense for your operational flow.

In many large data organizations, data in your warehouse is being automatically utilized in your business processes. If something breaks your data collection processes, bad data is being populated into your product dashboards and analytics and you have no way of knowing that the data they are being served is no good.

This can lead to some tangible losses. Imagine if there was a problem calculating a Comparative Analysis of home sale prices in the area. Users may lose trust in your data and stop using your product.

In this situation, what does your operational flow for incident management look like?

You receive some complaints from business stakeholders or customers, you have to invest a lot of engineering hours to perform root cause analysis, fix the issue, and backfill the data. All the while consumer trust has gone down, and SLAs have already been missed. DataOps is in a reactive position.

incident management

When you have data observability for your ingestion layer, there’s still a problem in this situation, but the way DataOps can handle this situation is very different:

  • You know that there will be a problem.     
  • You know exactly which data source is causing the problem.
  • You can project how this will affect downstream processes. You can make sure everyone downstream knows that there will be a problem so you can prevent the bad data from being used in the first place.
  • Most importantly, you can get started resolving the problem early & begin working on a way to prevent that from happening again.

You cannot achieve that level of prevention when your data observability starts at your warehouse.

Bottom Line: Time To Shift Left

DataOps is learning many of the same, hard lessons as DevOps has. Just as application observability is the most effective when shifted left, the same applies to data operations. It saves money; it saves time; it saves headaches. If you’re ingesting data from many external data sources, your organization cannot afford to focus all its efforts on the warehouse. You need real end-to-end data observability. And luckily, there’s a great data observability platform made to do just that.

data observability

How to ensure data quality, value, and reliability

2022-02-23 09:46:43

The quality of data downstream relies directly on data quality in the first mile. As early as ingestion, accurate and reliable data will ensure that the data used downstream for analytics, visualization, and data science will be of high value. 

For a business, this makes all the difference between benefiting from the data and having it play second fiddle when making decisions. In this blog post, we describe the importance of data quality, how to audit and monitor your data, and how to get your leadership, colleagues, and board – on board.

Topics covered:

  • Proactive Data Observability
  • Auditing Data for Quality
  • Data Quality or Data Value?
  • How to Approach the C-level and the Board
  • How to Train Internally
  • The Curse of the “Other”
  • Best Practices for Getting Started: Ensuring Data Quality Across the Enterprise

Proactive Data Observability

Managing data is like running a marathon. Many factors determine the end result, and it is a long process. However, suppose a runner trips and hurts her ankle at that first mile. In that case, she will not successfully complete the marathon. Similarly, if data isn’t monitored as early as ingestion, the rest of the pipeline will be negatively impacted.

How can we ensure data governance during this first mile of the data journey?

Data enters the pipeline from various sources: external APIs, data drops from outside providers, pulling from a database, etc. Monitoring data at the ingestion points ensures data engineers can gain proactive observability of the data coming in.

This enables them to wrangle and fix data to assure the process is healthy and reliable from the get-go.

By gaining proactive observability of data pipelines, data engineers can:

  • Trust the data
  • Easily identify breaking points
  • Quickly fix issues before they arrive at the warehouse or dashboard

Auditing Data for Quality

Data engineers who want to review their pipeline or audit and monitor an external data source can use the following questions during their evaluation:

  1. What’s the coverage scope?
  2. How is the data being tracked?
  3. Is there a master data reference that includes requirements and metadata?
  4. Is the customer defined in the right way?
  5. Is there a common hierarchy?
  6. Do the taxonomies leverage the business requirements?
  7. Are geographies correctly set?
  8. Are there any duplicates?
  9. Was the data searched before creating new entities?
  10. Is the data structured to enable seamless integrations and interoperability?

Now that we’ve covered how data engineers can approach data quality let’s see how to get buy-in from additional stakeholders in the enterprise.

Data Quality or Data Value?

Data engineers often talk about the quality of data. However, by changing the conversation to the value of the data, additional stakeholders in the organizations could be encouraged to take a more significant part in the data process. This is important for getting attention, resources, and for ongoing assistance.

To do so, we recommend talking about how the data aligns with business objectives. Otherwise, external stakeholders might think the conversation revolves only around cleaning up data.

4 Criterion for Determining Data Value – for Engineers and the Business:

  • Relevancy – Does the data meet the business objective?
  • Coverage – Does the data cover the entire market, enabling the enterprise to put it into play?
  • Structure – Is the data structured so the enterprise can use it?
  • Accuracy – is the data complete and correct?

How to Approach the C-level and the Board

By shifting the conversation to the value of the data rather than its quality, the C-level and the board can be encouraged to invest more resources into the data pipeline. Here’s how to approach them:

  1. Begin with the reasons why managing data is of strategic importance to your enterprise. Show how data can help execute strategic intentions.
  2. Explain how managing and analyzing data can help the company get to where it needs to go. Show how data can grow, improve, and protect the business. You can weave in the four criteria from before to emphasize your points.
  3. Connect the data to specific departments. Show how data can help improve operational efficiency, grow sales and mitigate risk. No other department can claim to help grow, improve and protect all departments to the same extent that data engineering can.
  4. Do not focus on the process and the technology – otherwise, you will have a very small audience.

How to Train Internally

In addition to the company’s leadership, it’s also important to get people on board in the company. This will help with data analysis and monitoring. Data engineers often need the company’s employees to participate in the ongoing effort of maintaining data. For example, salespeople are required to fill out multiple fields in a CRM when adding a new opportunity.

We recommend investing time in people management, i.e., training and ensuring everyone is on the same page regarding the importance of data quality. For example, explaining how identifying discrepancies accurately can help discover a business anomaly (rather than a data anomaly, which could happen if people don’t consistently and comprehensively update data).

The Curse of the “Other”

Data value auditing is crucial because it directly impacts the ability to make decisions on top of it. If you need an example to convince employees to participate in data management, remind them of “the curse of the ‘other’.”

When business units like marketing, product, and sales monitor dashboards, and a big slice is titled “other”, they do not have all the data they need and their decision-making is impaired. This is the result of a lack of data management and data governance.

Best Practices for Getting Started: Ensuring Data Quality Across the Enterprise

How can data engineers turn data quality from an abstract theory into practice? Let’s tie up everything we’ve covered into an actionable plan. 

Step 1 – Audit the Data Situation

First, assess which domains should be covered and how well they are being managed. This includes data types like: 

  • Relationship data: with customers, vendors, partners, prospects, citizens, patients, and clients
  • Brand data: products, services, offerings, banners, etc. 

Identify the mistakes at the different pipeline stages, starting from ingestion.

Step 2 – Showcase the Data Pipeline

Present the data situation to the various stakeholders. Show how the data is managed from the entry point to the end product. Then, explain how the current data value is impacting their decisions. Present the error points and suggest ways to fix them.

Step 3 – Prioritize Issues to Fix

Build a prioritized plan for driving change. Determine which issues to fix first. Include identifying sources and how they send data, internal data management, and training employees. Get buy-in to the plan, and proceed to execute it.


Ensuring data quality is the responsibility of data engineers and the entire organization. Monitoring data quality starts at the source. However, by getting buy-in from employees and management, data engineers can ensure they will get the resources and attention needed to monitor and fix data issues throughout the pipeline, and help the business grow.
To try out Databand, the observability platform for data quality and value, click here.

Ensuring data quality in healthcare: challenges and best practices

2022-02-11 14:50:11

The healthcare industry is very data-intensive. Multiple actors and organizations are transmitting large amounts of sensitive information. Data engineers in healthcare are tasked with ensuring data quality and reliability. This blog provides insights into how data engineers can proactively ensure data quality and prevent common errors by building the right data infrastructure and monitoring as early as ingestion.

This blog post is based on the podcast episode “Proactive Data Quality for Data-Intensive Organizations” with Johannes Leppae, Sr. Data Engineer at Komodo Health, which you can listen to below or here.

The Role of Data in Healthcare

The healthcare industry is made up of multiple institutions, service providers, and professionals. These include suppliers, doctors, hospitals, healthcare insurance companies, biopharma companies, laboratories, pharmacies, caregivers, and more. Each of these players creates, consumes, and relies on data for their operations.

High-quality and accurate data is essential for providing quality healthcare at low costs. For example, when running clinical trials, data is required to analyze patient populations, profile sites of care, alert when intervention is needed, and monitor the patient journey (among other needs).

Quality data will ensure a clinical trial is successful, resulting in better and faster patient treatment. However, erroneous or incomplete data could yield biased or noisy results, which could have severe consequences for patients.

Data Quality Challenges in Healthcare

Data engineers in healthcare need to reliably and seamlessly link together different types of sources and data. Then, they need to analyze the data to ensure it is complete and comprehensive so the downstream users have complete visibility.

However, the complexity of the healthcare system and the sensitivity of its data pose several data quality challenges for data engineers:

  • Fragmentation – Data is divided between many data assets, each containing a small piece of information.
  • Inconsistency – Data is created differently at each source. This includes variance between interfaces, filetypes, encryptions, and more.
  • Maintaining privacy – In many cases, like clinical trials, data needs to be de-identified to protect patients and ensure results are not biased.
  • Source orchestration – Ingesting data from multiple sources creates a lot of overhead when monitoring data.
  • Domain knowledge – Processing and managing healthcare data requires industry-specific knowledge since the data is often subject to medical business logic.

Ensuring Data Quality as Early as Ingestion

To overcome these challenges, data engineers need to find methods for monitoring errors. Data engineers can ensure that any issues are captured early by getting the data ready at the ingestion point. This prevents corrupt data from reaching downstream users, assures regulation compliance, and ensures data arrives on time. Early detection also saves data engineers from having to rerun pipelines when issues are found.

How big is the detection difference? Early detection enables identifying issues within hours. Later in the pipeline, the same issue could take days to detect.

One recommended way to ensure and monitor data quality is through structure and automation. The ingestion pipeline includes the following steps (among others):

  • Extraction of data files from external sources
  • Consolidating any variations
  • Pipeline orchestration
  • Raw data ingestion 
  • Unifying file formats 
  • Validation

To enable automation and scalability, it is recommended to create a unified structure across all pipelines and enforce systematic conventions for each stage. 

For example, collecting metadata like source identification, environment, data stream, and more. The conventions will be checked in the validation step before moving the data files downstream.

How to Deal with Data Quality Issues

The challenges of data-intensive ingesting sometimes require finding creative solutions. In the podcast this blog post is based on, Johannes describes the following scenario his data engineering team deals with constantly.

A common delivery issue in healthcare is data deliveries being late. Komodo Health’s systems had defined logic that matched the file’s date with the execution date. However, since files were often sent late, the dates didn’t match, and the pipeline wouldn’t find the file. This required the team to rerun the pipeline manually. To overcome this issue, the data engineering team changed the logic so that the pipeline picked up all files within the file’s timestamp. The late delivery was then automatically captured without needing manual intervention again.

In some cases, however, fixing issues requires going back to the source and asking the data engineering team to fix it. To minimize these cases and the friction they might cause, it’s recommended to create agreements to ensure everyone is on the same page when setting up the process. The agreement should include expectations, delivery standards, and SLAs, among others.

You can also make suggestions that will help with deliveries. For example, when deliveries have multiple files, ask the source to add a manifest file that states the number of files, the number of records for each file, and the last file being sent. 

Catching issues and bad batches of data on time is very important since it could significantly impact downstream users. It is especially important to be cautious in healthcare since analyses and life and death decisions are being made based on the data.

Choosing the Right Tools for Healthcare Data Engineering

Data engineers in healthcare face multiple challenges and require tools to assist them. While some prefer homegrown tools that support flexibility, buying a tool can relieve some of the effort and free engineers up for dealing with data quality issues.

When choosing a tool, it’s recommended to:

  1. Determine non-negotiables – features and capabilities the tool has to support.
  2. Decide on nice-to-haves – abilities that could help and make your life easier.
  3. Understand the roadmap – to see which features are expected to be added and determine how much influence you have over it.

Whichever tool you choose, make sure to see a demo of it. To see a demo of a Databand, which enables data quality monitoring as early as ingestion, click here.
To learn more about data-intensive organizations and hear the entire episode this blog post was based on, visit our podcast, here.

The ideal DataOps org structure

2021-08-27 15:13:37

The ideal data operations (DataOps) org structure

An organization’s external communications tend to reflect its internal ones. That’s what Melvin Conway taught us, and it applies to data engineering. If you don’t have a clearly defined data operations or “DataOps” team, your company’s data outputs will be just as messy as its inputs. 

For this reason, you probably need a data operations team, and you need one organized correctly.

conways law org structure

So first let’s back up—what is data operations?

Data operations is the process of assembling the infrastructure to generate and process data, as well as maintain it. It’s also the name of the team that does (or should do) this work—data operations, or DataOps. What does DataOps do? Well, if your company maintains data pipelines, launching one team under this moniker to manage those pipelines can bring an element of organization and control that’s otherwise lacking. 

DataOps isn’t just for companies that sell their data, either. Recent history has proven you need a data operations team no matter the provenance or use of that data. Internal customer or external customer, it’s all the same. You need one team to build (or let’s be real, inherit and then rebuild) the pipelines. They should be the same people (or, for many organizations, person) who implement observability and tracking tools and monitor the data quality across its four attributes. 

And of course, the people who built the pipeline should be the same people who get the dreaded PagerDuty alert when a dashboard is down—not because it’s punitive, but because it’s educational. When they have skin in the game, people build differently. It’s good incentive and allows for better problem solving and speedier resolution.

Last but not least, that data operations team needs a mission—one that transcends simply “moving the data” from point A to point B. And that is why the “operations” part of their title is so important.

Data operations vs data management—what’s the difference?

Data operations is building resilient processes to move data for its intended purpose. All data should move for a reason. Often, that reason is revenue. If your data operations team can’t trace a clear line from that end objective, like the sales teams having better forecasts and making more money, to their pipeline management activities, you have a problem. 

Without operations, problems will emerge as you scale:

  • Data duplication
  • Troubled collaboration
  • Waiting for data 
  • Band-aids that will scar
  • Discovery issues
  • Disconnected tools
  • Logging inconsistencies
  • Lack of process
  • Lack of ownership & SLAs

If there’s a disconnect, you’re simply practicing plain old data management. Data management is the rote maintenance aspect of data operations. Which, while crucial, is not strategic. When you’re in maintenance mode you’re hunting down the reason for a missing column or pipeline failure and patching it up, but you don’t have time to plan and improve.

Your work becomes true “operations” when you transform trouble tickets into repeatable fixes. Like, for example, you find a transformation error coming from a partner, and you email them to get it fixed before it hits your pipeline. Or you implement an “alerts” banner on your executives’ dashboard that tells them when something is wrong so they know to wait for the refresh. Data operations, just like developer operations, aims to put repeatable, testable, explainable, intuitive systems in place that ultimately reduce effort for all.

That’s data operations vs data management. And so the question then becomes, how should that data operations team be structured?

Organizing principles for a high-performing data operations team structure

So let’s return to where we began—talking about how your system outputs reflect your organizational structure. If your data operations team is an “operations” team in name only, and mostly only maintains, you’ll probably receive a forever ballooning backlog of requests. You’ll rarely have time to come up for air to make long-term maintenance changes, like switching out a system or adjusting a process. You’re stuck in Jira or ServiceNow response hell. 

If, on the other hand, you’ve founded (or relaunched) your data operations team with strong principles and structure, you produce data that reflects your high-quality internal structure. Good data operations team structures produce good data.

Principle 1: Organize in full-stack functional work groups

Gather a data engineer, a data scientist, and an analyst into a group or “pod” and have them address things together they might have addressed separately. Invariably, these three perspectives lead to better decisions, less fence-tossing, and more foresight. For instance, rather than the data scientist writing a notebook that doesn’t make sense and passing it to the engineer only to create a back-and-forth loop, they and the analyst can talk through what they need and the engineer can explain how it should be done.
Lots of data operations teams already work this way. “Teams should aim to be staffed as ‘full-stack,’ so the necessary data engineering talent is available to take a long view of the data’s whole life cycle,” say Krishna Puttaswamy and Suresh Srinivas at Uber. And at the travel site Agoda, the engineering team uses pods for the same reason.

Principle 2: Publish an org chart for your data operations team structure

Do this even if you’re just one person. Each role is a “hat” that somebody must wear. To have a high-functioning data operation team, it helps to know which hat is where, and who’s the data owner for what. You also need to reduce each individual’s span of control to a manageable level. Maybe drawing it out like this helps you make the case for hiring. 

What is data operations team management? A layer of coordination on top of your pod structures who plays the role of servant leader. They project manage, coach, and unblock. Ideally, they are the most knowledgeable people on the team.

We’ve come up with our own ideal structure, pictured, though it’s a work in progress. What’s important to note is there’s one single person leading with a vision for the data (the VP). Below them are multiple leaders guiding various data disciplines towards that vision (the Directors), and below them, interdisciplinary teams who ensure data org and data features work together. (Credit to our Data Solution Architect, Michael Harper, for these ideas.)

data operations org structure chart

Principle 3: Publish a guiding document with a DataOps North Star metric

Picking a North Star metric helps everyone involved understand what they’re supposed to optimize for. Without such an agreement, you get disputes. Maybe your internal data “customers” complain that the data is slow. But the reason it’s slow is because you know their unstated desire is to optimize for quality first.

Common DataOps North Stars: Data quality, automation (repeatable processes), and process decentralization (aka end-user self-sufficiency).

Once you have a North Star, you can also decide on sub-metrics or sub-principles that point to that North Star, which is almost always a lagging indicator. 

Principle 4: Build in some cross-functional toe-stepping

Organize the team so different groups within it must frequently interact and ask other groups for things. These interactions can prove priceless. “Where the data scientists and engineers learn about how each other work, these teams are moving faster and producing more,” says Amir Arad, Senior Engineering Manager at Agoda. 

Amir says he finds one of the hidden values to a little cross-functional redundancy is you get people asking questions nobody on that team had thought to ask. 

“The engineering knowledge gap is actually kinda cool. It can lead to them asking us to simplify,” says Amir. “They might say, ‘But why can’t we do that?’ And sometimes, we go back and realize we don’t need that code or don’t need that server. Sometimes non-experts bring new things to the table.”

Principle 5: Build for self-service

Just as with DevOps, the best data operations teams are invisible, and constantly working to make themselves redundant. Rather than play the hero who likes to swoop in to save everybody, but ultimately makes the system fragile, play the servant leader. Aim to, as Lao Tzu put it, lead people to the solution in a way that gets them thinking, “We did it ourselves.” 

Treat your data operations team like a product team. Study your customer. Keep a backlog of fixes. Aim to make the tool useful enough that the data is actually used. 

Principle 6: Build in full data observability from day one

There is no such thing as “too early” for data monitoring and observability. The analogy that’s often used to excuse putting off monitoring is, “We’re building the plane while in flight.” Think about that visual. Doesn’t that tell you everything you need to know about your long-term survival? A much better analogy is plain old architecture. The longer you wait to assemble a foundation, the more costly it is to put in, and the more problems the lack of one creates.

Read: Data observability: Everything you need to know

Principle 7: Secure executive buy-in for long-term thinking

The decisions you make now with your data infrastructure will, as General Maximus put it, “Echo in eternity.” Today’s growth hack is tomorrow’s gargantuan, data-transforming internal system chaos nightmare. You need to secure executive support to make inconvenient but correct decisions, like telling everyone they need to pause the requests because you need a quarter to fix things.

Principle 8: Use the “CASE” method (with attribution)

CASE stands for “copy and steal everything,” a tongue-in-cheek way of saying, don’t build everything from scratch. There are so many useful microservices and open-source offerings today. Stand on the shoulders of giants and focus on building the 40% of your pipeline that actually needs to be custom, and doing it well.

If you do nothing else today, do this

Go have a look at the tickets in your backlog. How often are you reacting to rather than preempting problems? How many of the problems you’ve addressed had a clearly identifiable root cause? How many were you able to fix permanently? The more you preempt, the more you resemble a true data operations team. And, the more helpful you’ll find a data observability tool. Full visibility can help you make the transition from simply maintaining to actively improving. 

Teams that actively improve their structure actively improve their data. Internal harmony leads to external harmony, in a connection that’d make Melvin Conway proud.

Apache Spark use cases for DataOps in 2021

2021-08-17 10:56:07

Apache Spark is a powerful data processing solution, and use cases for Apache Spark are near limitless. Over the last decade, it has become core to big data architecture. Expanding your headcount and your team’s knowledge in Spark is a necessity as data organizations adapt to market needs.

Just as the data industry matures, so do its tools. The meteoric rise in popularity of Databricks over its open-source origin clearly shows the overarching trend in the industry: The need for Apache Spark is growing, and the teams that use it are becoming more sophisticated.

apache spark data science analytics graph databand ai ml trend
The use case for Apache Spark is rooted in Big Data

For organizations that create and sell data products, fast data processing is a necessity. Their bottom line depends on it.

Science Focus estimates Google, Facebook, Microsoft, and Amazon store at least 1,200 petabytes of information. The amount of data they have collected is unthinkable, and for them, it’s mostly inaccessible. Even running on state-of-the-art tools, their data infrastructure cannot process and make meaningful use of all of the stored information.

That’s Big Data. Being able to keep pace with the amount of collected data processing required, and doing so quickly and accurately, means these companies can make their data products (e.i. platforms, algorithms, tools, widgets, etc.) more valuable to their users.

Though, you don’t necessarily need to have millions of users to need Spark. You just need to work with large datasets. Smaller data-driven organizations that have high standards for their data quality SLAs also use Spark to deliver more accurate data, faster. This data then powers their machine learning products, their analytical products, and other data-driven products.

Spark is a powerful solution for some organizations yet overkill for others. For those organizations, Apache Spark use cases are limited because the volume of data they process isn’t large enough, and the timeliness in which data must be delivered isn’t urgent enough to warrant the cost of computation.

When DataOps isn’t handling Big Data, they can’t justify building out a dedicated engineering team, and they can’t justify using specialized tools like Spark. The added complexities and connections to your infrastructure just leave more room for error. In this situation, every time data is extracted, passed to a cluster, computed, combined, and stored, you could open your pipeline to another opportunity for failures and bugs that are hard to catch.

How does Apache Spark work?

Let’s go over the basics of Spark before we start talking about use cases. And to do that, we should start with how Apache Spark came to be.

In the early 2000s, the amount of data being created started outpacing the volume of data that could be processed. To unplug this bottleneck, Hadoop was created based on the MapReduce design pattern. On a very high level, this design pattern sought to divide datasets into small pieces and “map” them to worker nodes (or in this case, separate disk drives called HDFS) to process the data in batches, and then “reduce” them into an “overall outcome,” so to speak.

databand diagram chart example data source design pattern

This worked well for a time, but as the volumes of data and the demand for greater processing speeds grew, the need for a new solution grew. Enter: Apache Spark.

apache spark hadoop databand hdfs diagram chart result

Apache Spark followed the same principle of distributed processing but achieved it in a different way. Spark jobs distribute partitions to in-memory on RAM rather than on HDFS drives. This means that the job doesn’t require reading and writing data partitions to a disk every time. This made Apache Spark 100x faster than Hadoop and brought data teams closer to real-time data processing.

There are more nuances to what makes Spark so useful, but let’s not lose focus. We’re here to learn about Apache Spark use cases for data products.

Apache Spark use cases for DataOps Initiatives

You can do a lot with Spark, but in this article, we’re going to talk about two use cases that are shaping the industry today:

  • Productizing machine learning models
  • Decentralized data organizations and data meshing

Let’s talk about each one of these use cases, and why they matter for data products in more detail.

Productizing ML models

Machine learning programs are booming as organizations begin an investment arms race, looking for a way to get an edge in their market.

According to Market Research Future, the global machine learning market is projected to grow from $7.3B in 2020 to $30.6B in 2024. And it’s easy to see why. If implemented correctly, the ROI of a high-performing ML product can range from around 2 to 5 times the cost.

That said, there’s a big gap between successful implementation and wasted investments. 9 out of 10 data science projects fail to make it to production because of the risk of bad performance and lack of accessibility to critical data. Even for companies at the forefront of the field, like Microsoft and Tesla, machine learning projects present a catastrophic risk if mismanaged.

Apache Spark was created to attempt to bridge that gap, and while it hasn’t eliminated all the barriers for entry, it has allowed for the proliferation of ML data products to continue.

Spark provides a general machine learning library that is designed for simplicity, scalability, and easy integration with other tools

MLlib, Apache Spark’s general machine learning library, has algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, clustering, and collaborative filtering. Some of these algorithms are also applicable to streaming data and can help provide sentiment analysis, customer segmentation, and predictive intelligence.

One of the main advantages of using Apache Spark for machine learning is its end-to-end capabilities. When building out an ML pipeline, the data engineer needs to cleanse, process, and transform data into the required format for machine learning. Then, data scientists can use MLlib or an external ML library, like TensorFlow or PyTorch, to apply ML algorithms and distribute the workload. Finally, analysts can use Spark for collecting metrics for performance scoring.

This helps data engineers and scientists solve and iterate their machine learning models faster because they can run an almost entirely end-to-end process on just Spark.

Data Meshing to democratize your data organization

Organizations are becoming more data-driven in their philosophy, but their data architecture is lagging behind. Organizations use centralized and highly siloed data architectures with owners who aren’t collaborating or communicating. Hence, why the overwhelming majority of nascent data products never make it to production

To create scalable and profitable data products, it’s essential that every data practitioner across your organization is able to collaborate with each other and access the raw data they need.

A solution to this, as first defined by Zhamak Dehghani in 2019, is a data mesh. A data mesh is a data platform architecture that uses a domain-oriented, self-service design. In traditional monolithic data infrastructures, one centralized data lake handles the consumption, storage, transformation, and output of data. A data mesh supports distributed, domain-specific data consumers and views “data-as-a-product” with each domain owning their data pipelines. Each domain and its associated data assets are connected by a universal interoperability layer that applies syntax and data standards across the distributed domains.

databand platform domain data internal pipeline infra engineer

This shift in architecture philosophy harkens to the transition software engineering went through from centralized applications to microservices. Data meshes enable greater autonomy and flexibility for data owners, greater data experimentation, and faster iterations while lessening the burden on data engineers to field the needs of every data consumer through a single pipeline.  

What does this have to do with Apache Spark?

There are two of the main objections to implementing the data mesh model. One is the need for these domains to have the data engineering skills to ingest, clean, and aggregate data on their own. The other main concern of domain-oriented design is the duplication of efforts, redundant or competing infrastructure, and competing standards for data quality. 

Apache Spark is great for solving that first problem. Organizations that have already built out a data mesh platform have used Databricks — which runs on Spark — to ingest data from their data-infra-as-a-platform layer to their domain-specific pipelines. Additionally, Spark is great for helping these self-service teams build out their own pipelines so they can test and iterate on their experiments without being blocked by engineering.

For many in the data industry, they find the idea of a data mesh interesting, but they worry that the unforeseen autonomy of a data mesh introduces new risks related to data health. Oftentimes, they decide this model isn’t right for their organization.

It’s not an unfounded fear. A data mesh needs a system for conducting scalable, self-serve observability to go along with it. According to Zhamak, some of those capabilities include:

  • Data product versioning
  • Data product schema
  • Unified data logging
  • Data product lineage
  • Data product monitoring/alerting/logging
  • Data product quality metrics (collection and sharing)

Our product, Databand, plays very nicely with the idea of a data mesh. It unifies observability, so each domain can use the tools they need, but still be able to answer questions like:

  • Is my data accurate?
  • Is my data fresh?
  • Is my data complete?
  • What is the downstream impact of changes to pipelines and pipeline performance?

Being able to answer those questions across your entire tech stack — especially a decentralized one, would allow data organizations to really reap the benefits of this new paradigm.

Distribution left unchecked can spell problems for your data health

Apache Spark is all about distribution. Whether you’re distributing ownership of pipelines across your organization or you’re distributing a computational workload across your cluster, Apache Spark can help make that process more efficient.

That said, the need for observability we talked about in the last section, applies just as much to traditional uses for Spark. That’s because distribution adds additional steps to the end-to-end lifecycle.

Spark is dividing up the computation task, sending partitions to clusters, computing each micro-batch in separate clusters, combining those outcomes, and sending it to the next phase of the pipeline lifecycle. That’s a complex process. Each added step to your pipeline adds an opportunity for error and complications to troubleshooting and root cause analysis.

So while the efficiency of Spark is worth it for some organizations, it’s also important to have a system of observability set up to manage data health and governance as data passes through your pipelines.

After all, what’s the point of running a workload faster if the outcome is wrong?