The Impact of Bad Data and Why Observability is Now Imperative
Think the impact of bad data is just a minor inconvenience? Think again.
Bad data cost Unity, a publicly-traded video game software development company, $110 million.
And that’s only the tip of the iceberg.
The Impact of Bad Data: A Case Study on Unity
Unity stock dropped 37% on May 11, 2022, after the company announced its first-quarter earnings, despite strong revenue growth, decent margins, good customer growth, and continued high performance in dollar-based net expansion.
But there was one data point in Unity’s earnings that were not as positive.
The company also shared that its Operate revenue growth was still up but had slowed due to a fault in its platform that reduced the accuracy of its Audience Pinpointer tool.
Unity ingested bad data from a large customer into its machine learning algorithm, which helps place ads and allows users to monetize their games. This not only resulted in decreased growth but also ruined the algorithm, forcing the company to fix it to remedy the problem going forward.
The company’s management estimated the impact on the business at approximately $110 million in 2022.
Unity Isn’t Alone: The Impact of Bad Data is Everywhere
Unity isn’t the only company that has felt the impact of bad data deeply.
On April 25, 2022, Twitter accepted a deal to be purchased by Tesla and SpaceX founder Elon Musk. A mere 18 days later, Musk shared that the deal was “on hold” as he confirmed the number of fake accounts and bots on the platform.
What ensued demonstrates the deep impact of bad data on this extremely high-profile deal for one of the world’s most widely-used speech platforms. Notably, Twitter has battled this data problem for years. In 2017, Twitter admitted to overstating its user base for several years, and in 2016 a troll farm used more than 50,000 bots to try to sway the US presidential election. Twitter first acknowledged fake accounts during its 2013 IPO.
Twitter, like Unity, is another high-profile example of the impact of bad data, but examples like this are everywhere – and it costs companies millions of dollars.
Gartner estimates that bad data costs companies nearly $13 million per year, although many don’t even realize the extent of the impact. Meanwhile, Harvard Business Review finds that knowledge workers spend about half of their time fixing data issues. Just imagine how much effort they could devote elsewhere if issues weren’t so prevalent.
Overall, bad data can lead to missed revenue opportunities, inefficient operations, and poor customer experiences, among other issues that add up to that multi-million dollar price tag.
Why Observability is Now Imperative for the C-Suite
The fact that bad data costs companies millions of dollars each year is bad enough. The fact that many companies don’t even realize this because they don’t measure the impact is potentially even worse. After all, how can you ever fix something of which you’re not fully aware?
Getting ahead of bad data issues requires data observability, which encompasses the ability to understand the health of data in your systems. Data observability is the only way that organizations can truly understand not only the impact of any bad data but also the causes of it – both of which are imperative to fixing the situation and stemming the impact.
It’s also important to embed data observability at every point possible with the goal of finding issues sooner in the pipeline rather than later because the further those issues progress, the more difficult (and more expensive) they become to fix.
Critically, this observability must be an imperative for C-suite leaders, as bad data can have a serious impact on company revenue (just ask Unity and Twitter). Making data observability a priority for the C-suite will help the entire organization – not just data teams – rally around this all-important initiative and make sure it becomes everyone’s responsibility.
Identify data issues earlier on in the data pipeline to stem their impact on other areas of the platform and/or business
Pinpoint data issues more quickly after the pop up to help arrive at solutions faster
Understand the extent of data issues that exist to get a complete picture of the business impact
In turn, this visibility can help companies recover more revenue faster by taking the necessary steps to mitigate bad data. Hopefully, the end result is a fix before the issues end up costing millions of dollars. And the only way to make that happen is if everyone, starting with the C-suite, prioritizes data observability.
What’s the Difference? Data Engineer vs Data Scientist vs Analytics Engineer?
The modern data team is, well, complicated.
Even if you’re on the data team keeping track of all the different roles and their nuances gets confusing – let alone if you’re a non-technical executive who’s supporting or working with the team.
One of the biggest areas of confusion? Understanding the differences between a data engineer vs data scientist vs analytics engineer roles.
The three are closely intertwined. And as Josh Laurito, Director of Data at Squarespace and editor of NYC Data, tells us there really is no single definition for each of these roles and the lines between them. You can listen to our full discussion with Josh Laurito below.
But still, there are some standard differences everywhere you go. And that’s exactly what we’ll look at today.
What is a data engineer?
A data engineer develops and maintains data architecture and pipelines. Essentially, they build the programs that generate data and aim to do so in a way that ensures the output is meaningful for operations and analysis.
Developing processes for data modeling and data generation
Standardizing data management practices
Important skills for data engineers include:
Expertise in SQL
Ability to work with structured and unstructured data
Deep knowledge in programming and algorithms
Experience with engineering and testing tools
Strong creative thinking and problem-solving abilities
What about an analytics engineer?
An analytics engineer brings together data sources in a way that makes it possible to drive consolidated insights. Importantly, they do the work of building systems that can model data in a clean, clear way repeatedly so that everyone can use those systems to answer questions on an ongoing basis. As one analytics engineer at dbt Labs puts it, a key part of analytics engineering is that “it allows you to solve hard problems once, then gain benefits from that solution infinitely.”
Some of their key responsibilities include:
Understanding business requirements and defining successful analytics outcomes
Cleaning, transforming, testing, and deploying data to be ready for analysis
Introducing definitions and documentation for key data and data processes
Bringing software engineering techniques like continuous integration to analytics code
Training others to use the end data for analysis
Consulting with data scientists and analysts on areas to improve scripts and queries
Deep understanding of software engineering best practices
Experience with data warehouse and data visualization tools
Strong capabilities around maintaining multi-functional relationships
Background in data analysis or data engineering
So then what’s a data scientist?
A data scientist studies large data sets using advanced statistical analysis and machine learning algorithms. In doing so, they identify patterns in data to drive critical business insights, and then typically use those patterns to develop machine learning solutions for more efficient and accurate insights at scale. Critically, they combine this statistics experience with software engineering experience.
Some of their key responsibilities include:
Transforming and cleaning large data sets into a usable format
Applying techniques like clustering, neural networks, and decision trees to gain insights from data
Analyzing data to identify patterns and spot trends that can impact the business
Deep expertise in machine learning, data conditioning, and advanced mathematics
Experience using big data tools
Understanding of API development and operations
Background in data optimization and data mining
Strong creative thinking and decision-making abilities
How does it all fit together?
Even seeing the descriptions of data engineer vs data scientist vs analytics engineer side-by-side can cause confusion, as there are certainly overlaps in skills and areas of focus across each of these roles. So how does it all fit together?
A data engineer builds programs that generate data, and while they aim for that data to be meaningful, it will still need to be combined with other sources. An analytics engineer brings together those data sources to build systems that allow users to access consolidated insights in an easy-to-access, repeatable way. Finally, a data scientist develops tools to analyze all of that data at scale and identify patterns and trends faster and better than any human could.
Critically, there needs to be a strong relationship between these roles. But too often, it ends up as dysfunctional. Jeff Magnuson, Vice President, Data Platform at Stitch Fix, wrote about this topic several years ago in an article titled Engineers Shouldn’t Write ETL. The crux of his article was that teams shouldn’t have separate “thinkers” and “doers”. Rather, high-functioning data teams need end-to-end ownership of the work they produce, meaning that there shouldn’t be a “throw it over the fence” mentality between these roles.
The result is a high demand for data scientists who have an engineering background and understand things like how to build repeatable processes and the importance of uptime and SLAs. In turn, this approach has an impact on the role of data engineers, who can then work side-by-side with data scientists in an entirely different way. And of course, that cascades to analytics engineers as well.
Understanding the difference between data engineer vs data scientist vs analytics engineer once and for all – for now
The truth remains that many organizations define each of these roles differently. It’s difficult to draw a firm line between where one ends and where one begins because they all have similar tasks to some extent. As Josh Laurito concludes: “Everyone writes SQL. Everyone cares about the quality. Everyone evaluates different tables and writes data somewhere, and everyone complains about time zones. Everyone does a lot of the same stuff. So really the way we [at Squarespace] divide things is where people are in relation to our primary analytical data stores.”
At Squarespace, this means data engineers are responsible for all the work done to create and maintain those stores, analytics engineers are embedded into the functional teams to support decision making, put together narratives around the data, and use that to drive action and decisions, and finally, data scientists sit in the middle, setting up the incentive structures and the metrics to make decisions and guide people.
Of course, it will be slightly different for every organization. And as blurry as the lines are now, each of these roles will only continue to evolve and further shift the dynamics across each of them. But hopefully, this overview helps solve the question of what’s the difference between data engineer vs data scientist vs analytics engineer – for now.
Monitoring for data quality issues as early as ingestion: here’s why
Maintaining data quality is challenging. Data is often unreliable and inconsistent, especially when it flows from multiple data sources. To deal with quality issues and prevent them from impacting your product and decision-making, you need to monitor your data flows. Monitoring helps identify schema changes, discover missing data, catch unusual levels of null records, fix failed pipelines, and more. In this blog post, we will explain why we recommend monitoring data starting at the source (ingestion), list three potential use cases for ingestion monitoring, and finish off with three best practices for data engineers to get started.
This blog post is based on the first episode of our podcast, “Why Data Quality Begins at the Source”, which you can listen to below or here.
Where Should You Monitor Data Quality?
Data quality monitoring is a fairly new practice, and different tools offer different monitoring capabilities across the data pipeline. While some tools monitor data quality where the data rests, i.e at the data warehouse, at Databand we think it’s critical to monitor data quality as early as ingestion, not just at the warehouse level. Let’s look at a few reasons why.
4 Reasons to Monitor Data Quality Early in the Data Pipeline
1. Higher Probability of Identifying Issues
Erroneous or abnormal data affects all the other data and analytics downstream. Once corrupt data has been ingested and flows to the data lake/warehouse, it might already be mixed up with healthy data and used in analyses. This makes it much more difficult to identify the errors and their source, because the dirty data can be “washed out” in the higher volumes of data that sits at rest
In fact, the ability to identify issues is based on an engineer or analyst knowing what the expected data results should be, recognizing a problematic anomaly, and diagnosing that this anomaly is the result of data and not a business change. When the corrupt data is such a small percentage of the entire data lake, this becomes even harder.
Why take the chance of overlooking errors and problems that could impact the product, your users, and decision making? By monitoring early in the pipeline, many issues can be avoided because you are monitoring a more targeted sample of data, and therefore able to create a more precise baseline for when data looks unusual.
2. Creating Confidence in the Warehouse
Analysts and additional stakeholders rely on the warehouse to make a wide variety of business and product decisions. Trusting warehouse data is essential for business agility and for making the right decisions. If data in the warehouse is “known” to have issues, stakeholders will not use it or trust it. This means the organization is not leveraging data to the full extent.
If the data warehouse is the heart of the customer-facing product, i.e the product relies almost entirely on data, then corrupt data could jeopardize the entire product’s adoption in the market.
By identifying data issues faster, data engineers have more time to react. They can identify causality and lineage, and fix the data or source to prevent any harmful impact that corrupt data could have. Trying to identify and fix full-blown issues in the product or after decision-making is much harder to do.
4. Enabling Data Source Governance
By analyzing, monitoring and identifying the point of inception, data engineers can identify a malfunctioning source and act to fix it. This provides better governance over sources, in real-time and in the long-run.
When Should You Monitor Data Quality from Ingestion?
We recommend monitoring data quality across the pipeline, from ingestion and at rests. However, you need to start somewhere… Here are the top three use cases for prioritizing monitoring at ingestion:
Frequent Source Changes – When your business relies on data sources or APIs where data structure frequently changes, it is recommended to continuously monitor them. For example, in the case of a transportation application that pulls data from the constantly changing APIs of location data, user tracking information, etc.
Multiple External Data Sources – When your business’s output depends on analyzing data from dozens or hundreds of sources. For example, a real-estate app that provides listings based on data from offices, municipalities, schools, etc.
Data-Driven Products – When your product is based on data and each data source has a direct impact on the product. For example, navigation applications that pull data about roads, weather, transportation, etc.
Getting Started with Data Quality Monitoring
As mentioned before, data quality monitoring is a relatively new practice. Therefore, it makes sense to implement it gradually. Here are three recommended best practices:
1. Determine Quality Layers
Data changes across the pipeline, and so does its quality. Divide your data pipeline into various steps, e.g the warehouse layer, the transformation layer, and the ingestion layer. Understand that data quality means different things at each of these stages and prioritize the layers that have the most impact on your business.
2. Monitor Different Quality Depths
When monitoring data, there are different quality aspects to review. Start with reviewing metadata and ensuring the data structure was correct and that all the data arrived. Once metadata has been verified, move on to address explicit business-related aspects of the data, which relate to domain knowledge.
3. See Demos of Different Data Monitoring Tools
Once you’ve mapped out your priorities and pain points, it’s time to find a tool that can automate this process for you. Don’t hesitate to see demos of different tools and ask the hard questions about data quality assurance. For example, to see a demo of Databand, click here. To learn more about data quality and hear the entire episode this blog post was based on, visit our podcast, here.
10 Advanced Data Pipeline Strategies for Data Engineers
If you’re like us, schematics for an ideal data pipeline are nice, but not always helpful. The gap between theory and practice is vast and it’s common for people to make suggestions online irrespective of realities like, say, budget. Or, without knowing that you rarely spin up data pipeline architecture entirely from scratch.
In this guide, we share advanced-level strategies for managing data pipelines in the real world, so you appear to be the data ninja your team already thinks you are.
What is a data engineering pipeline?
A data pipeline is a series of connected processes that moves data from one point to another, possibly transforming it along the way. It’s linear, with sequential and sometimes parallel executions. The analogy—“a pipeline”—is also helpful in understanding why pipelines that move data can be so difficult to build and maintain.
How should a data pipeline work? Predictably, only changing it in expected ways. A pipeline should be designed from the ground up to maintain data quality, or data health, along the four dimensions that matter: fitness, lineage, governance, and stability.
The problem with opaque data pipelines
Sometimes you come in one morning and the pipeline is down and you have no idea why. Sometimes someone knows why and you fix it in minutes. More often, the diagnosis takes days.
The challenge is the way pipelines tend to be built—opaque—just like real-life oil-bearing pipelines. You can’t peer inside. If there’s a leak somewhere, or a screwy transformation, it takes a lot of time to figure out where that’s happening, or whether the pipeline is even responsible. What if it’s an issue with an upstream data provider? Or if it is indeed your issue, are you uncovering proximal causes or root ones?
The issue is often several degrees deep. If your scheduler runs and tells you it was successful, but all values are missing, you have an issue. When you dig in, perhaps a Spark job failed. But why did it fail? This is where the real work begins—understanding all the ways things can and do go wrong in the real world so you build data pipelines that function in reality.
Ten engineering strategies for designing, building, and managing a data pipeline
Below are ten strategies for how to build a data pipeline drawn from dozens of years of our own team’s experiences. We have included quotes from data engineers which have mostly been kept anonymous to protect their operations.
1. Understand the precedent
Before you do anything, spend time understanding what came before. Know the data models of the systems that preceded yours, know the quirks of the systems you’re pulling from and importing to, and know the expectations of the business users. Call it a data audit and record your findings along with a list of questions that still need answering.
For example, at a large retailer, the most exciting thing isn’t the tool that works by itself but the one that works cooperatively with a legacy architecture and helps you migrate off it. It’s not uncommon for these teams to have 10,000 hours of work invested in some of their existing products. If someone tries something new and it fails in a big way, they may lose their job. Given the option, most would rather not touch it. For them, compatibility is everything, and so they must first understand the precedent.
2. Build incrementally
Build pieces of your pipeline as you need them in a modular fashion that you can adjust. The reason is, you won’t know what you need until you build something that doesn’t quite suit your purpose. It’s one of the many paradoxes of data engineering. The requirements aren’t clear until a business user asks for a time series that they only just now realized they need, but which is unsupportable.
3. Document your goals as you go
Your goals will continue to evolve as you build. Create a shared living document (Google Docs will do) and revisit and update it. Also ask others who will be involved in the pipeline, upstream or downstream, to document their goals as well. In our experience, everyone is going to tend to presume others are thinking what they’re thinking. It’s only by documenting that you realize someone wants a metric that, say, includes personally identifiable information (PII) and so is not allowed.
4. Build to minimize cost
Costs will always be higher than you expect. We have never met an engineer who said, “And to our great surprise, it cost half as much as we first thought.” When planning spend, all the classic personal finance rules apply: Overestimate costs by 20%, don’t spend what you don’t yet have, avoid recurring costs, and keep a budget.
If there are components that will need to grow exponentially, and you can pull them off of a paid platform and do it for (nearly) free, that may be the key to you accomplishing twice as much with this pipeline, and to building more.
Even as data lake providers launch features like cost alerts budgetary kill-switches, the principle remains: Build to minimize cost from the very beginning.
5. Identify the stakes and tolerance
High stakes and low tolerance systems require careful planning. For example, a rocket going into space with human lives onboard. But in the data world, most decisions are reversible. That means it can often be cheaper in terms of your time and effort to simply try it and revert rather than agonizing for weeks while deciding.
For an ecommerce company, the stakes might at first seem low. But after talking to business users, you might learn that the downstream effects of a data error could make millions of products appear available in a store when they’re not, creating a web of errors and missed expectations you can’t easily untangle.
Knowing the stakes and tolerance tells you how much “breaking” you can afford to do.
6. Organize in functional work groups
Create working groups that include an analyst, a data scientist, an engineer, and possibly someone from the business side. Have them focus on problems as a unit. It’s far more effective. If they simply worked sequentially, tossing requirements over the fence to one another, everyone would eventually grow frustrated, there’d be a lot of inefficient ‘work about work,’ and things would take forever. Functional groups tend to build better data pipelines that cost less.
This approach also gives data engineers a seat at the table when decisions are being made so they can vet ideas at the outset. If all they do is wait for notebooks from the data scientist, they’ll often discover they don’t work and they’ll either have to send it back or rewrite it themselves. Or, they’ll find that other teams continuously ask for columns that are derivable from other data, but which must be transformed.
“A constant challenge is ensuring my data engineers have a good contract with data scientists and know how to take products from them and smoothly integrate them into the system. Even with pods, it’s not always smooth.”
-Data Engineering Team Lead
7. Implement monitoring and observability data pipeline tools
Some tools help you keep costs low, and observability tools fall into that category. They provide instrumentation to help you understand what’s happening within your pipeline. Without highly specific answers to questions around why data pipelines fail, you can spend an inordinate amount of time diagnosing the proximal and root causes of pipeline issues.
“Observability” is a bit of a buzzword these days, but it serves as an umbrella term to encompass:
Monitoring—a dashboard that provides an operational view of your pipeline or system
Alerting—alerts, both for expected events and anomalies
Tracking—ability to set and track specific events
Comparisons—monitoring over time, with alerts for anomalies
Analysis—anomaly detection that adapts to your pipeline and data health
Next best action—recommended actions to fix errors
8. Use a decision tree to combat tool sprawl
Nobody wants yet another point-solution tool that you then have to maintain. Create a decision tree for your team to decide when it makes sense to add another tool versus adjust an existing one, or evaluate a platform that would consolidate several functions. It’s good for data quality too. The fewer moving pieces, the less there is to diagnose.
9. Build your pipeline to control for all four dimensions of data quality
We’ve published a model for the four dimensions of data quality that matter to engineers—fitness, lineage, governance, and stability. These dimensions must exist in equilibrium, and you cannot maintain quality without addressing all four.
10. Document things as a byproduct of work
Also known as, “knowledge-centered service,” you should be in the habit of documenting what you do, and at the very least, keeping a log your team can access. The highest achievement for a data engineer is not being a hero that the entire company depends on, but constructing a system that’s so durable it outlasts you. Documentation should be intrinsic to your work.
Sometimes, you need to move fast and break things to meet a deadline. While that may make your data consumers happy in the short term, they won’t be happy when it all comes crumbling down underneath the weight of technical debt. More often than not, a little planning upfront and following these best practices can avoid a lot of headaches down the road. Though, these best practices won’t help you avoid the fickle nature of data altogether; you would need a data observability platform to catch anomalies & data quality issues as they crop up.
Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. To learn more about Databand and how our platform helps data engineers with their data pipelines, request a demo!