Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

What is Data Lineage?

Databand
2022-07-28 10:20:00

The term “data lineage” has been thrown around a lot over the last few years.

What started as an idea of connecting between datasets quickly became a very confusing term that now gets misused often.

It’s time to put order to the chaos and dig deep into what it really is. Because the answer matters quite a lot. And getting it right matters even more to data organizations. 

This article will unpack everything you need to know about data lineage, including:

  • What is it?
  • What’s the difference between data lineage and data provenance?
  • Why is it important?
  • What are common data lineage techniques?
  • What are data lineage best practices?
  • What is end-to-end lineage vs. data at rest lineage?
  • What are the benefits of end-to-end data lineage?
  • What should you look for in a data lineage tool?

What is data lineage?

Its purpose is to track data throughout its complete lifecycle. It looks at the data source to its end location and notes any changes (including what changed, why it changed, and how it changed) along the way. And it does all of this visually.

Usually, it provides value in two key areas:

  1. Development process: Knowing what affects what and what could be the impact of making changes. 
  2. Debugging process: Understanding the severity, impact, and root cause of issues.

In general, it makes it possible to identify errors in data, reduce the risk associated with system and process changes, and increase trust in data. All of these are essential at a time when data plays such an integral role in business outcomes and decision-making.

Data lineage in action: A simplified example

When data engineers talk about it, they often imagine a data observability platform that allows them to understand the logical relationship between datasets that are affecting each other in a specific business flow.

data lineage

In this very simplified example, we can see an ELT:

  • Some pipeline tasks, probably running by Airflow, are scraping external data sources and collecting data from there.
  • Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse).
  • Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.
  • Dashboarding tools such as Tableau, Looker, or Power BI are being used on top of the business data and providing visibility to multiple stakeholders.

What’s the difference between data lineage and data provenance?

Data lineage and data provenance are often viewed as one and the same. While the two are closely related, there is a difference.

Whereas data lineage tracks data throughout the complete lifecycle, data provenance zooms in on the data origin. It provides insight into where data comes from and how it gets created by looking at important details like inputs, entities, systems, and processes for the data.

Data provenance can help with error tracking when understanding data lineage and can also help validate data quality.

Why is it important?

As businesses use more big data in more ways, having confidence in that data becomes increasingly important – just look at Elon Musk’s deal to buy Twitter for an example of trust in data gone wrong. Consumers of that data need to be able to trust in its completeness and accuracy and receive insights in a timely manner. This is where data lineage comes into play.

Data lineage instills this confidence by providing clear information about data origin and how data has moved and changed since then. In particular, it is important to key activities like:

  • Data governance: Understanding the details of who has viewed or touched data and how and when it was changed throughout its lifecycle is essential to good data governance. Data lineage provides that understanding to support everything from regulatory compliance to risk management around data breaches. This visibility also helps ensure data is handled in accordance with company policies.
  • Data science and data analytics: Data science and data analytics are critical functions for organizations that are using data within their business models, and powering strong data science and analytics programs requires a deep understanding of data. Once again, data lineage offers the necessary transparency into the data lifecycle to allow data scientists and analysts to work with the data and identify its evolutions over time. For instance, data lineage can help train (or re-train) data science models based on new data patterns.
  • IT operations: If teams need to introduce new software development processes, update business processes, or adjust data integrations, understanding any impact to data along the way –  as well as where data might need to come from to support those processes – is essential. Data lineage not only delivers this visibility, but it can also reduce manual processes associated with teams tracking down this information or working through data silos.
  • Strategic decision making: Any organization that relies on data to power strategic business decisions must have complete trust that the data they’re using is accurate, complete, and timely. Data lineage can help instill that confidence by showing a clear picture of where data has come from and what happened to it as it moved from one point to another.
  • Diagnosing issues: Should issues arise with data in any way, teams need to be able to identify the cause of the problem quickly so that they can fix it. The visibility provided by data lineage can help make this possible by allowing teams to visualize the path data has taken, including who has touched it and how and when it changed.

What are common techniques?

There are several commonly used techniques for data lineage that collect and store information about data throughout its lifecycle to allow for a visual representation. These techniques include:

  • Pattern-based lineage: Evaluates metadata for patterns in tables, columns, and reports rather than relying on any code to perform data lineage. This technique focuses directly on the data (vs. algorithms), making it technology-agnostic; however, it is not always the most accurate technique.
  • Self-contained lineage: Tracks data movement and changes in a centralized system, like a data lake that contains data throughout its entire lifecycle. While this technique eliminates the need for any additional tools, it does have a major blind spot to anything that occurs outside of the environment at hand.
  • Lineage by data tagging: A transformation engine that tags every movement or change in data allows for lineage by data tagging. The system can then read those tags to visualize the data lineage. Similar to self-contained lineage, this technique only works for contained systems, as the tool used to create the tags will only be able to look within a single environment.
  • Lineage by parsing: An advanced form of data lineage that reads logic used to process data. Specifically, it provides end-to-end tracing by reverse engineering data transformation logic. This technique can get complicated quickly, as it requires an understanding of all the programming logic used throughout the data lifecycle (e.g. SQL, ETL, JAVA, XML, etc.).

What are data lineage best practices?

When it comes to introducing and managing data lineage, there are several best practices to keep in mind:

  • Automate data lineage extraction: Manual data lineage centered around spreadsheets is no longer an option. Capturing the dynamic nature of data in today’s business environments requires an automated solution that can keep up with the pace of data and reduce the errors associated with manual processes.
  • Bring metadata source into data lineage: Systems that handle data, like ETL software and database management tools, all create metadata – or data about the data they handle (meta, right?). Bringing this metadata source into data lineage is critical to gaining visibility into how data was used or changed and where it’s been throughout its lifecycle.
  • Communicate with metadata source owners: Staying in close communication with the teams that own metadata management tools is critical. This communication allows for verification of metadata (including its timeliness and accuracy) with the teams that know it best.
  • Progressively extract metadata and lineage: Progressive extraction – or extracting metadata and lineage in the same order as it moves through systems – makes it easier to do activities like mapping relationships, connections, and dependencies across the data and systems involved.
  • Progressively validate end-to-end data lineage: Validating data lineage is important to make sure everything is running as it should. Doing this validation progressively by starting with high-level system connections, moving to connected datasets, then elements, and finishing off with transformation documentation simplifies the process and allows it to flow more logically.
  • Introduce a data catalog: Data catalog software makes it possible to collect data lineage across sources and extract metadata, allowing for end-to-end data lineage.

What is end-to-end lineage vs. data at rest lineage?

When talking about lineage, most conversations usually tackle the scenario of data “in-the-warehouse,” which presumes everything is occurring in a contained data warehouse or data lake. In these cases, it monitors data executions that are performed on specific or multiple tables to extract the relationship within or among them. 

At Databand, we refer to this as “data at rest lineage,” since it observes the data after it was already loaded into the warehouse.

This data at rest lineage can be troublesome for modern data organizations, which typically have a variety of stakeholders (think: data scientist, analyst, end customer), each of which has very specific outcomes they’re optimizing toward. As a result, they each have different technologies, processes, and priorities and are usually siloed from one another. Data at rest lineage that looks at data within a specific data warehouse or data lake typically doesn’t work across these silos or data integrations.

Instead, what organizations need is end-to-end data lineage, which looks at how data moves across data warehouses and data lakes to show the true, complete picture.

Consider the case of a data engineer who owns end-to-end processes within dozens of dags in different technologies. If that engineer encounters corrupted data, they want to know the root cause. They want to be able to proactively catch issues before they land on business dashboards and to track the health of the different sources on which they rely. Essentially, they want to be able to monitor the real flow of the data.

With this type of end-to-end lineage, they could see that a SQL query has introduced corrupted data to a column in a different table or that a DBT test failure has affected other analysts’ dashboards. In doing so, end-to-end lineage captures data in motion, resulting in a visual similar to the following:

data lineage

What are the benefits of end-to-end data lineage?

Modern organizations need true end-to-end lineage because it’s no longer enough just to monitor a small part of the pipeline. While data at rest lineage is easy to integrate, it provides very low observability across the entire system.

Additionally, data at rest lineage is limited across development languages and technologies. If everything is SQL-based, that’s one thing. But the reality is, modern data teams will use a variety of languages and technologies for different needs that don’t get covered with the more siloed approach.

As if that wasn’t enough, most of the issues with data happen before it ever reaches the data warehouse, but data at rest lineage won’t capture those issues. If teams did have that visibility though, they could catch issues sooner and proactively protect business data from corruption.

End-to-end data lineage solves these challenges and delivers several notable benefits, including:

  • Clear visibility on impact: If there’s a schema change in the external API from which Python fetches data, teams need true end-to-end visibility to know which business dashboard will be affected. Gaining that visibility requires understanding the path of data in motion across environments and systems – something only end-to-end data lineage that tracks data in motion can provide.
  • Understanding of root cause: By the time an issue hits a table used by analysts, the problem is already well underway, stemming from further back in the data lifecycle. With data at rest lineage, it’s only possible to see what’s happening in that particular table, though – which isn’t helpful for identifying the cause of the issue. End-to-end lineage, on the other hand, can look across the complete lifecycle to provide clarity into the root cause of issues, wherever they turn up.
  • Ability to connect between pipelines and datasets: In a very complex environment where thousands of pipelines (or more!) are writing and reading data from thousands of datasets, the ability to identify which pipeline is working on a weekly, daily, or hourly bases and with which tables (or even specific columns within tables) is a true game-changer.

What should you look for in a data lineage tool?

As it becomes increasingly important, what should you look for in a data lineage tool? 

Above all else, you need a tool that can power end-to-end data lineage (vs. data at rest lineage). You also need a solution that can automate the process, as manual data lineage simply won’t cut it anymore.

With those prerequisites in mind, other capabilities to consider when evaluating a data lineage tool include:

  • Alerts: Automated alerts should allow you to not just identify that an incident has occurred, but gain context on that incident before jumping into the details. This context might include high-level details like the data pipeline experiencing an issue and the severity of the issue.
  • View of affected datasets: The ability to see all of the datasets impacted by a particular issue in a single, birds-eye view is helpful to understanding the effect on operations and the severity of the issue.
  • Visual of data lineage: Visualizing data lineage by seeing a graph of relationships between the data pipeline experiencing the issue and its dependencies allows you to gain a deeper understanding of what’s happening and what’s affected as a result. The ability to click into tasks and see the dependencies and impact to each one for a given task provides even more clarity when it comes to issue resolution.
  • Debugging within tasks: Finally, the ability to see specific errors within specific tasks allows for quick debugging of issues for faster resolution.

Getting it right

Data lineage isn’t a new concept, but it is one that’s often misunderstood. However, as data becomes more critical to more areas of business, getting it right is increasingly important.

It requires an understanding of exactly what data lineage is and why it’s so important. Additionally, it requires a thoughtful approach to addressing data lineage that matches the needs of a modern data organization – which means true end-to-end data lineage. And finally, it requires the right tool to support this end-to-end lineage in an automated way.

implement end-to-end data lineage

The Impact of Bad Data and Why Observability is Now Imperative

Databand
2022-06-02 14:09:07

Think the impact of bad data is just a minor inconvenience? Think again. 

Bad data cost Unity, a publicly-traded video game software development company, $110 million.

And that’s only the tip of the iceberg.

The Impact of Bad Data: A Case Study on Unity

Unity stock dropped 37% on May 11, 2022, after the company announced its first-quarter earnings, despite strong revenue growth, decent margins, good customer growth, and continued high performance in dollar-based net expansion. 

But there was one data point in Unity’s earnings that were not as positive. 

The company also shared that its Operate revenue growth was still up but had slowed due to a fault in its platform that reduced the accuracy of its Audience Pinpointer tool.

The fault in Unity’s platform? Bad data

Unity ingested bad data from a large customer into its machine learning algorithm, which helps place ads and allows users to monetize their games. This not only resulted in decreased growth but also ruined the algorithm, forcing the company to fix it to remedy the problem going forward.

The company’s management estimated the impact on the business at approximately $110 million in 2022.

Unity Isn’t Alone: The Impact of Bad Data is Everywhere

Unity isn’t the only company that has felt the impact of bad data deeply.

Take Twitter.

On April 25, 2022, Twitter accepted a deal to be purchased by Tesla and SpaceX founder Elon Musk. A mere 18 days later, Musk shared that the deal was “on hold” as he confirmed the number of fake accounts and bots on the platform. 

What ensued demonstrates the deep impact of bad data on this extremely high-profile deal for one of the world’s most widely-used speech platforms. Notably, Twitter has battled this data problem for years. In 2017, Twitter admitted to overstating its user base for several years, and in 2016 a troll farm used more than 50,000 bots to try to sway the US presidential election. Twitter first acknowledged fake accounts during its 2013 IPO.

Now, this data issue is coming to a head, with Musk investigating Twitter’s claim that fake accounts represent less than 5% of the company’s user base and angling to reduce the previously agreed upon purchase price as a result.

Twitter, like Unity, is another high-profile example of the impact of bad data, but examples like this are everywhere – and it costs companies millions of dollars. 

Gartner estimates that bad data costs companies nearly $13 million per year, although many don’t even realize the extent of the impact. Meanwhile, Harvard Business Review finds that knowledge workers spend about half of their time fixing data issues. Just imagine how much effort they could devote elsewhere if issues weren’t so prevalent.

Overall, bad data can lead to missed revenue opportunities, inefficient operations, and poor customer experiences, among other issues that add up to that multi-million dollar price tag.

Why Observability is Now Imperative for the C-Suite

The fact that bad data costs companies millions of dollars each year is bad enough. The fact that many companies don’t even realize this because they don’t measure the impact is potentially even worse. After all, how can you ever fix something of which you’re not fully aware?

Getting ahead of bad data issues requires data observability, which encompasses the ability to understand the health of data in your systems. Data observability is the only way that organizations can truly understand not only the impact of any bad data but also the causes of it – both of which are imperative to fixing the situation and stemming the impact.

It’s also important to embed data observability at every point possible with the goal of finding issues sooner in the pipeline rather than later because the further those issues progress, the more difficult (and more expensive) they become to fix.

Critically, this observability must be an imperative for C-suite leaders, as bad data can have a serious impact on company revenue (just ask Unity and Twitter). Making data observability a priority for the C-suite will help the entire organization – not just data teams – rally around this all-important initiative and make sure it becomes everyone’s responsibility.

This focus on end-to-end data observability can ultimately help:

  • Identify data issues earlier on in the data pipeline to stem their impact on other areas of the platform and/or business
  • Pinpoint data issues more quickly after the pop up to help arrive at solutions faster
  • Understand the extent of data issues that exist to get a complete picture of the business impact

In turn, this visibility can help companies recover more revenue faster by taking the necessary steps to mitigate bad data. Hopefully, the end result is a fix before the issues end up costing millions of dollars. And the only way to make that happen is if everyone, starting with the C-suite, prioritizes data observability.

impact of bad data

What is Dark Data and How it Causes Data Quality Issues

Databand
2022-05-31 17:11:25

We’re all guilty of holding onto something that we’ll never use. Whether it’s old pictures on our phones, items around the house, or documents at work, there’s always that glimmer of thought that we just might need it one day.

It turns out businesses are no different. But in the business setting, it’s not called hoarding, it’s called dark data.

Simply put, dark data is any data that an organization acquires and stores during regular business activities that doesn’t actually get used in any way. No one analyzes it to gain insights, drive decisions, or make money – it just sits there.

Unfortunately, dark data can prove quite troublesome, causing a host of data quality issues. But it doesn’t have to be all bad. This article will explore what you need to know about dark data, including:

  • What is dark data
  • Why dark data is troublesome
  • How dark data causes data quality issues
  • The upside of dark data
  • Top tips to shine the light on dark data

What is dark data?

According to Gartner, dark data is “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships, and direct monetizing). Storing and securing data typically incurs more expense (and sometimes greater risk) than value.”

And most companies have a lot of dark data. Carnegie Mellon University finds that about 90% of most organizations’ data is dark data, to be exact.

How did this happen? A lot of organizations operate in silos, and this can easily lead to situations in which one department would make use of the data that another department captures, but they’re not even aware that data is getting captured (and therefore they’re not using it).

We also got here because not too long ago we had the idea that it’s valuable to store all the information we could possibly capture in a big data lake. As data became more and more valuable, we thought maybe one day that data would be important – so we should hold onto it. Plus, data storage is cheap, so it was okay if it sat there totally unused. 

But maybe it’s not as good an idea as we once thought.

Why is dark data troublesome?

If the data could be valuable one day and data storage is cheap, what’s the big issue with it? There are three problems to start

1) Liability

Often with dark data, companies don’t even know exactly what type of data they’re storing. And they could very well (and often do) have personally identifiable information sitting there without even realizing it. This could come from any number of places, such as transcripts from audio conversations with customers or data shared online. But regardless of the source, storing this data is a liability. 

A host of global privacy laws have been introduced over the past several years, and they apply to all data – even data that’s sitting unused in analytics repositories. As a result, it’s risky for companies to store this data (even if they’re not using it) because there’s a big liability if anyone accesses that information.

2) Accumulated costs

Data storage at the individual level might be cheap, but as companies continue to collect and store more and more data over time, those costs add up. Some studies show companies spend anywhere from $10,000 to $50,000 in storage just for dark data alone.

Getting rid of that data that’s not used for any purpose could then lead to significant cost savings. Savings that can be re-allocated to any number of more constructive (and less troublesome) purposes.

3) Opportunity costs

Finally, many companies are losing out on opportunities by not using this data. So while it’s good to get rid of data that’s actually not usable – due to risks and costs – it pays to first analyze what data is available.

In taking a closer look at their dark data, many companies may very well find that they can better manage and use that data to drive some interesting (and valuable!) insights about their customers or their own internal metrics. Hey, it’s worth a look.

How dark data causes data quality issues

Interestingly enough, sometimes dark data gets created because of data quality issues. Maybe it’s because incomplete or inaccurate data comes in, and therefore teams know they won’t use it for anything.

For example, perhaps it’s a transcript from an audio recording, but the AI that creates the transcript isn’t quite there yet and the result is rife with errors. Someone keeps the transcript though, thinking that they’ll resolve it at some point. This is an example of how data quality issues can create dark data.

In this way, it can often be used to understand the sources of bad data quality and the effects of that. Far too often, organizations aim to clean poor quality data, but they miss what’s causing the issue. And without that understanding, it’s impossible to fully resolve the data quality issue from continuing to happen.

When this happens, the situation becomes very cyclical, because rather than simply purging dark data that sits around without ever getting used, organizations let it continue to sit – and that contributes to growing data quality issues.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

  1. Analyze and identify the “as is” situation, including the current issues, existing data standards, and the business impact in order to prioritize the issue.
  2. Prevent bad data from recurring by evaluating the root cause of the issues and applying resources to tackle that problem in a sustainable way.
  3. Communicate often along the way, sharing what’s happening, what the team is doing, the impact of that work, and how those efforts connect to business goals.

The upside of dark data

But for all the data quality issues that dark data can (and, let’s be honest, does) cause, it’s not all bad. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”

Specifically, as data remains an extremely valuable asset, organizations must learn how to use everything they have to their advantage. In other words, that nagging thought that the data just might be useful one day could actually be true. Of course, that’s only the case if organizations actually know what to do with that data… otherwise it will continue to sit around and cause data quality issues.

The key to getting value out of dark data? Shining the light on it by breaking down silos, introducing tighter data management, and, in some cases, not being afraid to let data go.

Top tips to shine the light on dark data

When it comes to handling dark data and potentially using it to your organization’s advantage, there are several best practices to follow:

  1. Break down silos: Remember earlier when we said that dark data often comes about because of silos across teams? One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos instantly makes that data available to the team that needs it, and suddenly it goes from sitting around to providing immense value.
  2. Improve data management: Next, it’s important to really get a handle on what data exists. This starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize data better with the goal of making it easier for individuals across teams to find and use what they need.
  3. Introduce a data governance policy: Finally, introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and if so, how it should be organized to maintain clear data management), archived, or destroyed. An important part of this policy is being strict about what data should be destroyed. Enforcing that policy and regularly reviewing practices can help eliminate dark data that will never really be used.

It’s time to solve the dark data challenge and restore data quality

Dark data is a very real problem. Far too many organizations hold onto data that never gets used, and while it might not seem like a big deal, it is. It can create liabilities, significant storage costs, and data quality issues. It can also lead to missed opportunities due to teams not realizing what data is potentially available to them.

Taking a proactive approach to managing this data can turn the situation around. By shining the light on dark data, organizations can not only reduce liabilities and costs, but also give teams the resources they need to better access data and understand what’s worth saving and what’s not. And doing so will also improve data quality. It’s a no-brainer.

What’s the Difference? Data Engineer vs Data Scientist vs Analytics Engineer?

Databand
2022-05-26 13:25:17

The modern data team is, well, complicated.

Even if you’re on the data team keeping track of all the different roles and their nuances gets confusing – let alone if you’re a non-technical executive who’s supporting or working with the team.

One of the biggest areas of confusion? Understanding the differences between a data engineer vs data scientist vs analytics engineer roles.

The three are closely intertwined. And as Josh Laurito, Director of Data at Squarespace and editor of NYC Data, tells us there really is no single definition for each of these roles and the lines between them. You can listen to our full discussion with Josh Laurito below.

But still, there are some standard differences everywhere you go. And that’s exactly what we’ll look at today.

What is a data engineer?

A data engineer develops and maintains data architecture and pipelines. Essentially, they build the programs that generate data and aim to do so in a way that ensures the output is meaningful for operations and analysis.

Some of their key responsibilities include:

  • Managing pipeline orchestration
  • Building and maintaining a data platform
  • Leading any custom data integration efforts
  • Optimizing data warehouse performance
  • Developing processes for data modeling and data generation
  • Standardizing data management practices

Important skills for data engineers include:

  • Expertise in SQL
  • Ability to work with structured and unstructured data
  • Deep knowledge in programming and algorithms
  • Experience with engineering and testing tools
  • Strong creative thinking and problem-solving abilities

What about an analytics engineer?

An analytics engineer brings together data sources in a way that makes it possible to drive consolidated insights. Importantly, they do the work of building systems that can model data in a clean, clear way repeatedly so that everyone can use those systems to answer questions on an ongoing basis. As one analytics engineer at dbt Labs puts it, a key part of analytics engineering is that “it allows you to solve hard problems once, then gain benefits from that solution infinitely.

Some of their key responsibilities include:

  • Understanding business requirements and defining successful analytics outcomes
  • Cleaning, transforming, testing, and deploying data to be ready for analysis
  • Introducing definitions and documentation for key data and data processes
  • Bringing software engineering techniques like continuous integration to analytics code
  • Training others to use the end data for analysis
  • Consulting with data scientists and analysts on areas to improve scripts and queries

Important skills for analytics engineers include:

  • Expertise in SQL
  • Deep understanding of software engineering best practices
  • Experience with data warehouse and data visualization tools
  • Strong capabilities around maintaining multi-functional relationships
  • Background in data analysis or data engineering

So then what’s a data scientist?

A data scientist studies large data sets using advanced statistical analysis and machine learning algorithms. In doing so, they identify patterns in data to drive critical business insights, and then typically use those patterns to develop machine learning solutions for more efficient and accurate insights at scale. Critically, they combine this statistics experience with software engineering experience.

Some of their key responsibilities include:

  • Transforming and cleaning large data sets into a usable format
  • Applying techniques like clustering, neural networks, and decision trees to gain insights from data 
  • Analyzing data to identify patterns and spot trends that can impact the business
  • Developing machine learning algorithms to evaluate data
  • Creating data models to forecast outcomes

Important skills for a data scientist include:

  • Expertise in SAS, R, and Python
  • Deep expertise in machine learning, data conditioning, and advanced mathematics
  • Experience using big data tools
  • Understanding of API development and operations
  • Background in data optimization and data mining
  • Strong creative thinking and decision-making abilities

How does it all fit together?

Even seeing the descriptions of data engineer vs data scientist vs analytics engineer side-by-side can cause confusion, as there are certainly overlaps in skills and areas of focus across each of these roles. So how does it all fit together?

A data engineer builds programs that generate data, and while they aim for that data to be meaningful, it will still need to be combined with other sources. An analytics engineer brings together those data sources to build systems that allow users to access consolidated insights in an easy-to-access, repeatable way. Finally, a data scientist develops tools to analyze all of that data at scale and identify patterns and trends faster and better than any human could.

Critically, there needs to be a strong relationship between these roles. But too often, it ends up as dysfunctional. Jeff Magnuson, Vice President, Data Platform at Stitch Fix, wrote about this topic several years ago in an article titled Engineers Shouldn’t Write ETL. The crux of his article was that teams shouldn’t have separate “thinkers” and “doers”. Rather, high-functioning data teams need end-to-end ownership of the work they produce, meaning that there shouldn’t be a “throw it over the fence” mentality between these roles. 

The result is a high demand for data scientists who have an engineering background and understand things like how to build repeatable processes and the importance of uptime and SLAs. In turn, this approach has an impact on the role of data engineers, who can then work side-by-side with data scientists in an entirely different way. And of course, that cascades to analytics engineers as well.

Understanding the difference between data engineer vs data scientist vs analytics engineer once and for all – for now

The truth remains that many organizations define each of these roles differently. It’s difficult to draw a firm line between where one ends and where one begins because they all have similar tasks to some extent. As Josh Laurito concludes: “Everyone writes SQL. Everyone cares about the quality. Everyone evaluates different tables and writes data somewhere, and everyone complains about time zones. Everyone does a lot of the same stuff. So really the way we [at Squarespace] divide things is where people are in relation to our primary analytical data stores.”

At Squarespace, this means data engineers are responsible for all the work done to create and maintain those stores, analytics engineers are embedded into the functional teams to support decision making, put together narratives around the data, and use that to drive action and decisions, and finally, data scientists sit in the middle, setting up the incentive structures and the metrics to make decisions and guide people. 

Of course, it will be slightly different for every organization. And as blurry as the lines are now, each of these roles will only continue to evolve and further shift the dynamics across each of them. But hopefully, this overview helps solve the question of what’s the difference between data engineer vs data scientist vs analytics engineer – for now.

The Data Value Chain: Data Observability’s Missing Link

Databand
2022-05-18 13:18:01

Data observability is an exploding category. It seems like there is news of another data observability tool receiving funding, an existing tool is announcing expanded functionality, and many new products in the category are being dreamt up. After a bit of poking around, you’ll notice that many of them claim to do the same thing: end-to-end data observability. But what does that really mean and what’s a data value chain?

For data analysts, end-to-end data observability feels like having monitoring capabilities for their warehouse tables  — and if they’re lucky, they have some monitoring for the pipelines that move the data to and from their warehouse as well.

The story is a lot more complicated for many other organizations that are more heavily skewed towards data engineering. For them, that isn’t end-to-end data observability. That’s “The End” data observability. Meaning: this level of observability only gives visibility into the very end of the data’s lifecycle. This is where the data value chain becomes an important concept.

For many data products, data quality is determined from the very beginning; when data is first extracted and enters your system. Therefore, shifting data observability left of the warehouse is the best way to move your data operations out of a reactive data quality management framework, to a proactive one.

data observability

What is the Data Value Chain?

When people think of data, they often think of it as a static object; a point on a chart, a number in a dashboard, or a value in a table. But the truth is data is constantly changing and transforming throughout its lifecycle. And that means what you define as “good data quality” is different for each stage of that lifecycle.

“Good” data quality in a warehouse might be defined by its uptime. Going to the preceding stage in the life cycle, that definition changes. Data quality might be defined by its freshness and format. Therefore, your data’s quality isn’t some static binary. It’s highly dependent on whether things went as expected in the preceding step of its lifecycle.

Shani Keynan, our Product Director, calls this concept the data value chain.

“From the time data is ingested, it’s moving and transforming. So, only looking at the data tables in your warehouse or your data’s source, or only looking at your data pipelines, it just doesn’t make a lot of sense. Looking only at one of those, you don’t have any context.

You need to look at the data’s entire journey. The thing is, when you’re a data-intensive company who’s using lots of external APIs and data sources, that’s a large part of the journey. The more external sources you have, the more vulnerable you are to changes you can’t predict or control. Covering the hard ground first, at the data’s extraction, makes it easier to catch and resolve problems faster since everything downstream depends on those deliveries.”

The question of whether data will drive value for your business is defined by aseries of If-Then statements:

  1. If data has been ingested correctly from our data sources, then our data will be delivered to our lake as expected.
  2. If data is delivered & grouped in our lake as expected, then our data will be able to be aggregated & delivered to our data warehouse as expected.
  3. If data is aggregated & delivered to our data warehouse as expected, then the data in our warehouse can be transformed.
  4. If data in our warehouse can be transformed correctly, then our data will be able to be queried and will provide value for the business.
data warehouse

Let us be clear: this is an oversimplification of the data’s life cycle. That said, it illustrates how having observability only for the tables in your warehouse & the downstream pipelines leaves you in a position of blind faith.

In the ideal world, you would be able to set up monitoring capabilities & data health checkpoints everywhere in your system. This is no small project for most data-intensive organizations; some would even argue it’s impractical.

Realistically, one of the best places to start your observability initiative is at the beginning of the data value chain; at the data extraction layer.

Data Value Chain + Shift-left Data Observability

If you are one of these data-driven organizations, how do you set your data team up for success?

While it’s important to have observability of the critical “checkpoints” within your system, the most important checkpoint you can have is at the data collection process. There are two reasons for that:

#1 – Ingesting data from external sources is one of the most vulnerable stages in your data model.

As a data engineer, you have some degree of control over your data & your architecture. But what you don’t control is your external data sources. When you have a data product that depends on external data arriving on time to function, that is an extremely painful experience.

This is best highlighted in an example. Let’s say you are running a large real estate platform called Willow. Willow is a marketplace where users can search for homes and apartments to buy & rent across the United States.

Willow’s goal is to give users all the information they need to make a buying decision; things like listing price, walkability scores, square footage, traffic scores, crime & safety ratings, school system ratings, etc.

In order to calculate “Traffic Score” for just one state in the US, Willow might need to ingest data from 3 external data sources. There are 50 states, so that means you suddenly have 150 external data sources you need to manage. And that’s just for one of your metrics.

Here’s where the pain comes in: You don’t control these sources. You don’t get a say whether they decide to change their API to better fit their data model. You don’t get to decide whether they drop a column from your dataset. You can’t control if they miss one of their data deliveries and leave you hanging.

All of these factors put your carefully crafted data model at risk. All of them can break your pipelines downstream that follow strictly coded logic. And there’s really nothing you can do about it except catching it as early as you can.

Having data observability in your data warehouse doesn’t so much to solve this problem. It might alert you that there is bad data in your warehouse, but by that point, it’s already too late.

This brings us to our next point…

#2 – It makes the most sense for your operational flow.

In many large data organizations, data in your warehouse is being automatically utilized in your business processes. If something breaks your data collection processes, bad data is being populated into your product dashboards and analytics and you have no way of knowing that the data they are being served is no good.

This can lead to some tangible losses. Imagine if there was a problem calculating a Comparative Analysis of home sale prices in the area. Users may lose trust in your data and stop using your product.

In this situation, what does your operational flow for incident management look like?

You receive some complaints from business stakeholders or customers, you have to invest a lot of engineering hours to perform root cause analysis, fix the issue, and backfill the data. All the while consumer trust has gone down, and SLAs have already been missed. DataOps is in a reactive position.

incident management

When you have data observability for your ingestion layer, there’s still a problem in this situation, but the way DataOps can handle this situation is very different:

  • You know that there will be a problem.     
  • You know exactly which data source is causing the problem.
  • You can project how this will affect downstream processes. You can make sure everyone downstream knows that there will be a problem so you can prevent the bad data from being used in the first place.
  • Most importantly, you can get started resolving the problem early & begin working on a way to prevent that from happening again.

You cannot achieve that level of prevention when your data observability starts at your warehouse.

Bottom Line: Time To Shift Left

DataOps is learning many of the same, hard lessons as DevOps has. Just as application observability is the most effective when shifted left, the same applies to data operations. It saves money; it saves time; it saves headaches. If you’re ingesting data from many external data sources, your organization cannot afford to focus all its efforts on the warehouse. You need real end-to-end data observability. And luckily, there’s a great data observability platform made to do just that.

data observability



Data Replication: The Basics, Risks, and Best Practices

Databand
2022-04-27 13:20:30

Data-driven organizations are poised for success. They can make more efficient and accurate decisions and their employees are not impeded by organizational silos or lack of information. Data replication enables leveraging data to its full extent. But how can organizations maximize the potential of data replication and make sure it helps them meet their goals? Read on for all the answers.

What is Data Replication?

Data replication is the process of copying or replicating data from the main organizational server or cloud instance to other cloud or on-premises instances at different locations. Thanks to data replication, organizational users can access the data they need for their work quickly and easily, wherever they are in the world. In addition, data replication ensures organizations have backups of their data, which is essential in case of an outage or disaster. In other words, data replication creates data availability at low latency.

Data replication can take place either synchronously or asynchronously. Synchronously means the data is constantly copied to the main server and all replica servers at the same time. Asynchronous data replication means that data is first copied to the main server and only then copied to replica servers. Often, it occurs in scheduled intervals.

Why Data Replication is Necessary

Data replication ensures that organizational data is always available to all stakeholders. By replicating data across instances, organizations can ensure:

Scalability

Data scalability is the ability to handle changing demands by continuously adapting resources. Replication of data across multiple servers builds scalability and ensures the availability of consistent data to all users at all times.

Disaster Protection

Electrical outages, cybersecurity attacks and natural disasters can cause systems and instances to crash and no longer be available. By replicating data across multiple instances, data is backed up and always accessible to any stakeholder. This ensures system robustness, organizational reliability and security.

Speed / Latency

Data that has to travel across the globe creates latency. This creates a poor user experience, which can be felt especially in real-time based applications like gaming or recommendation systems, or resource-heavy systems like design tools. By distributing the data globally it travels a shorter distance to the end user, which results in increased speed and performance.

Test System Performance

By distributing and synchronizing data across multiple test systems, data becomes more accessible. This availability improves their performance.

An Example of Data Replication

Organizations that have multiple branch offices across a number of continents can benefit from data replication. If organizational data only resides in servers in Europe, users from Asia, North America and South America will experience latency when attempting to read the data. But by replicating data across instances in San Francisco, São Paulo, New York, London, Berlin, Prague, Tel Aviv, Hyderabad, Singapore and Melbourne, for example, all users can improve access times for all users significantly.

Data Replication Variations

Types of Data Replication

Replication systems vary. Therefore, it is important to distinguish which type is a good fit for your organizational infrastructure needs and business goals. There are three main types of data replication systems:

Transactional Replication

Transaction replication consists of databases being copied in their entirety from the primary server (the publisher) and sent to secondary servers (subscribers). Any data changes are consistently and continuously updated. Transactional consistency is ensured, which means that data is replicated in real-time and sent from the primary server to secondary servers in the order of their occurrence. As a result, transactional replication makes it easy to track changes and any lost data. This type of replication is commonly used in server-to-server environments.

Snapshot Replication

In the snapshot replication type, a snapshot of the database is distributed from the primary server to the secondary servers. Instead of continuous updates, data is sent as it exists at the time of the snapshot. It is recommended to use this type of replication when there are not many data changes or at the initial synchronization between the publisher and subscriber.

Merge Replication

A merge replication consists of two databases being combined into a single database. As a result, any changes to data can be updated from the publisher to the subscribers. This is a complex type of replication since both parties (the primary server and the secondary servers) can make changes to the data. It is recommended to use this type of replication in a server-to-client environment.

Comparison Table: Transactional Replication vs. Snapshot replication vs. Merge Replication

Data Replication Table

Schemes of Replication

Replication schemes are the operations and tasks required to perform replication. There are three main replication schemes organizations can choose from:

Full Replication

Full replication occurs when the entire database is copied in its entirety to every site in the distributed system. This scheme improves data availability and accessibility through database redundancy. In addition, performance is improved because global distribution of data reduces latency and accelerates query execution. On the other hand, it is difficult to achieve concurrency and update processes are slow.

Data Replication - Full

Partial Replication

In a partial replication scheme, some sections of the database are replicated across some or all of the sites. The description of these fragments can be found in the replication schema. Partial replication enables prioritizing which data is important and should be replicated as well as distributing resources according to the needs of the field.

Data Replication - Partial

No Replication

In this scheme, data is stored on one site only. This enables easily recovering data and achieving concurrency. On the other hand, it negatively impacts availability and performance.

No Data Replication

Techniques of Replication

Replicating data can take place through different techniques. These include:

Full-table Replication

In a full-table replication, all data is copied from the source to the destination. This includes new data, as well as existing data. It is recommended to use this technique if records are regularly deleted or if other techniques are technically impossible. On the other hand, this technique requires more processing and network resources and the cost is higher.

Key-based Replication

In a key-based replication, only new data that has been added since the previous update, is updated. This technique is more efficient since less rows are copied. On the other hand, it does not enable replication data from a previous update that might have been hard-deleted.

Log-based Replication

A log-based replication replicates any changes to the database, from the DB log file. It applies only to database sources and has to be supported by it. This technique is recommended when the source database structure is static, otherwise it might become a very resource-intensive process.

Cloud Migration + Data Replication

When organizations digitally transform their infrastructure and migrate to the cloud, data can be replicated to cloud instances. By replicating data to the cloud, organizations can enjoy its benefits: scalability, global accessibility, data availability and easier maintenance. This means organizational users benefit from data that is more accessible, usable and reliable, which eliminates internal silos and increases business agility.

Data Risks in the Replication Process

When replicating data to the cloud, it is important to monitor the process. The growing complexity of data systems as well as the increased physical distance between servers within a system could pose some risks.

These risks include:

Inconsistency

Data schema and data profiling anomalies, like null counts, type changes and skew.

Data Loss

Ensuring all data has been migrated from the sources to the instances.

Delays

Data not being successfully migrated on time.

Data Replication Management + Observability

By implementing a management system to oversee and monitor the replications process, organizations can significantly reduce the risks involved in the data replication process. A data observability platform will ensure:

  • Data is successfully replicated to other instances, including cloud instances
  • Replication and migration pipelines are performing as expected
  • Any broken pipelines or irregular data volumes are alerted about so they can be fixed
  • Data is delivered on time 
  • Delivered data is reliable, so organizational stakeholders can use it for analytics

Monitoring

By monitoring the data pipelines that take part in the replication process, organizations and their DataOps engineer can ensure the data propagated through the pipeline is accurate, complete and reliable. This ensures data replicated to all instances can be reliably used by stakeholders. An effective monitoring system will be:

  • Granular – specifically indicating where the issue is
  • Persistent – following lineage to understand where errors began
  • Automated – reducing manual errors and enabling the use of thresholds
  • Ubiquitous – covering the pipeline end-to-end
  • Timely – enabling catching errors on time before they have an impact

Learn more about data monitoring here.

Tracking

Tracking pipelines enables systematic troubleshooting, so that any errors are identified and fixed on time. This ensures users constantly benefit from updated, reliable and healthy data in their analyses. There are various types of metadata that can be tracked, like task duration, task status, when data was updated, and more. By tracking and alerting (see below) in case of anomalies, DataOps engineers ensure data health.

Alerting

Alerting about and data pipeline anomalies is an essential step that closes the observability loop. Alerting DataOps engineers gives them the opportunity to fix any data health issues that might affect data replication across various instances.

Within existing data systems, data engineers can trigger alerts for:

  • Missed data deliveries
  • Schema changes that are unexpected
  • SLA misses
  • anomalies in column-level statistics like nulls and distributions
  • Irregular data volumes and sizes
  • Pipeline failures, inefficiencies, and errors

By proactively setting up alerts and monitoring them through dashboards and other tools of your choice (Slack, Pagerduty, etc.), organizations can truly maximize the potential of data replication for their business.

Conclusion

Data replication holds great promise for organizations. By replicating data to multiple instances, they can ensure data availability and improved performance, as well as internal “insurance” in case of a disaster. This page covers the basics for any business or data engineer getting started with data replication: the variations, schemes and techniques, as well as more advanced content for monitoring the process to gain observability and reduce the potential risk.

Wherever you are on your data replication journey, we recommend auditing your pipelines to ensure data health. If you need help finding and fixing data health issues fast, click here.

The Top Data Quality Metrics You Need to Know (With Examples)

Databand
2022-04-20 14:17:41

Data quality metrics can be a touchy subject, especially within the focus of data observability.

A quick google search will show that data quality metrics involve all sorts of categories. 

For example, completeness, consistency, conformity, accuracy, integrity, timeliness, continuity, availability, reliability, reproducibility, searchability, comparability, and probably ten other categories I forgot to mention all relate to data quality. 

So what are the right metrics to track? Well, we’re glad you asked. 🙂 

We’ve compiled a list of the top data quality metrics that you can use to measure the quality of the data in your environment. Plus, we’ve added a few screenshots that highlight each data quality metric you can view in Databand’s observability platform

Take a look and let us know what other metrics you think we need to add!

Collection Data Quality Metrics

The Top 9 Data Quality Metrics

Metric 1: # of Nulls in Different Columns 

Who’s it for? 

  • Data engineers
  • Data analysts

How to track it? 

Calculate the number of nulls, non-null counts, and null percentages per column so users can set an alert on those metrics.

Why it’s important?

Since a null is the absence of value, you want to be aware of any nulls that pass through your data workflows. 

For example, downstream processes might be damaged if the data used is now “null” instead of actual data.

Dropped columns

The values of a column might be “dropped” by mistake when the data processes are not performing as expected. 

This might cause the entire column to disappear, which would make the issue easier to see. But sometimes, all of its values will be null.

Data drift

The data of a column might slowly drift into “nullness.” 

This is more difficult to detect than the above since the change is more gradual. Monitoring anomalies in the percentage of nulls across different columns should make it easier to see.

What’s it look like?

Data Quality Metrics Null Count

Metric 2: Frequency of Schema Changes

Who’s it for?

  • Data engineers
  • Data scientists
  • Data analysts

How to track it? 

Tracking all changes in the schema for all the datasets related to a certain job.

Why it’s important?

Schema changes are key signals of bad quality data. 

In a healthy situation, schema changes are communicated in advance and are not frequent since many processes rely on the number of columns and their type in each table to be stable. 

Frequent changes might indicate an unreliable data source and problematic DataOps practices, resulting in downstream data issues.

Examples of changes in the schema can be: 

  • Column type changes
  • New columns 
  • Removed columns

Go beyond having a good understanding of what changed in the schema and evaluate the effect this change will have on downstream pipelines and datasets.

What’s it look like?

Data Quality Metrics Schema change
Data Quality Metrics Alert

Metric 3: Data Lineage, Affected Processes Downstream

Who’s it for? 

  • Data engineers
  • Data analysts

How to track it? 

Tack the data linage with assets that appear downstream from a dataset with an issue. This includes datasets and pipelines that consume the upstream dataset’s data.

Why it’s important?

The more damaged data assets (datasets or pipelines) downstream, the bigger the issue’s impact. This metric helps the data engineer to understand the severity of the issue and how fast he should fix it.

It is also an important metric for data analysts because most downstream datasets make up their company’s BI reports.

What’s it look like?

Data Quality Metrics Lineage

Metric 4: # of Pipeline Failures 

Who’s it for? 

  • Data engineers
  • Data executives

How to track it? 

Track the number of failed pipelines over time. 

Use tools to understand why the pipeline failed, root cause analysis through the error widget and logs, and the ability to dive inside all the tasks that the DAG contains.

Why it’s important?

The more pipelines fail, the more data health issues you’ll have.

Each pipeline failure causes issues like missing data operations, schema changes, and data freshness issues.

If you’re experiencing many failures, this indicates severe problems at the root that needs to be addressed.

What’s it look like?

Data Quality Metrics Error widget, pipeline, tasks

Metric 5: Pipeline Duration

Who’s it for? 

  • Data engineers

How to track it? 

The team can track this with the Airflow syncer, which reports on the total duration of a DAG run, or by using our tracking context as part of the Databand SDK.

Why it’s important?

Pipelines that work in complex data processes are usually expected to have similar duration across different runs. 

In these complex environments, pipelines downstream depend on upstream pipelines processing the data in certain SLAs

The effect of extreme changes in the pipeline’s duration can be anywhere between the processing of stale data and a failure of downstream processes.

What’s it look like?

Data Quality Metrics Pipeline duration

Metric 6: Missing Data Operations

Who’s it for? 

  • Data engineers
  • Data scientists
  • Data analysts
  • Data executives

How to track it? 

Tracking all the operations related to a particular dataset.

A data operation is a combination of a task in a specific pipeline that reads or writes to a table. 

Why it’s important?

When a certain data operation is missing, a chain of issues in your data stack will be triggered. It can cause pipelines to fail, changes in the schema, and delay problems.

Also, the downstream consumers of this data will be affected by the data that didn’t arrive.  

A few examples include: 

  • The data analyst who is using this data for analysis 
  • The ML models used by the data scientist
  • The data engineers in charge of the data.

What’s it look like?

Data Quality Metrics Missing dataset
ata Quality Metrics dbnd alert

Metric 7: Record Count in a Run

Who’s it for? 

Data engineers, data analysts

How to track it? 

Track the number of raws written to a dataset.

Why it’s important?

A sudden change in the expected number of table rows signals that too much data is being written. 

Using anomaly detection in the number of rows in a dataset provides a good way of checking that nothing suspicious has happened.

What’s it look like?

Data Quality Metrics Record count in a run

Metric 8: # of Tasks Read From Dataset

Who’s it for? 

Data engineer

How to track it? 

The more tasks read from a certain dataset, the more central it is and the more important this dataset. 

Why it’s important?

Understanding the importance of the dataset is crucial for impact analysis and realizing how fast you should deal with the issue you have.

What’s it look like?

Data Quality Metrics - Tasks Read from Dataset

Metric 9: Data Freshness (SLA alert)

Who’s it for? 

Data Engineers, Data Scientists, Data Analysts

How to track it? 

We are tracking the scheduled pipelines to write to a certain dataset.

Why it’s important?

Un-fresh and un-updated data can cause wrong feeding of downstream reports and wrong information to be consumed.

A good way of knowing data freshness is to monitor your SLA and get notified of delays in the pipeline that should be written to the dataset.

What’s it look like?

Data Quality Metrics SLA alert

Wrapping it up

And that’s a quick look at some of the top data quality metrics you need to know to deliver more trustworthy data to the business. 

Check out how you can build all these metrics in Databand today.

What is a Data Catalog? Overview and Top Tools to Know

Databand
2022-04-14 12:01:00

Intro to Data Catalogs

A data catalog is an inventory of all of an organization’s data assets. A data catalog includes assets like machine learning models, structured data, unstructured data, data reports, and more. By leveraging data management tools, data analysts, data scientists, and other data users can search through the catalog, find the organizational data they need, and access it.

Governance of data assets in a data catalog is enabled through metadata. The metadata is used for mapping, describing, tagging, and organizing the data assets. As a result, it can be leveraged to enable data consumers to efficiently search through assets and get information on how to use the data. Metadata can also be used for augmenting data management, by enabling onboarding automation, anomalies alerts, auto-scaling, and more.
In addition to indexing the assets, a data catalog usually includes data access and data searching capabilities, as well as tools for enriching the metadata, both manually and automatically. It also provides capabilities for ensuring compliance with privacy regulations and security standards.
In modern organizations, data catalogs have become essential for leveraging the large amounts of data generated. Efficient data analysis and consumption can help organizations make better decisions, so they can optimize operations, build better models, increase sales, and more.

Data Catalog Benefits (Why Do You Need a Data Catalog?)

A data catalog provides multiple benefits to data professionals, business analysts, and organizations. These include:

User Autonomy

Data professionals and other data consumers can find data, evaluate it and understand how to use it – all on their own. With a data catalog, they no longer have to rely on IT or other professional personnel. Instead, they can immediately search for the data they need and use it. This speed and independence enable injecting data into more business operations. It also improves employee morale.

Improved Data Context and Quality

The metadata and comments on the data from other data citizens can help data consumers better understand how to use it. This additional information creates context and improves the data quality and encourages data usage, innovation, and more new business ideas.

Organizational Efficiency

Accessible data reduces operational friction and bottlenecks, like back and forth emails, which optimizes the use of organizational resources. Available data also accelerates internal processes. When data consumers get the data and understand how to use it faster, data analysis and implementation take place faster as well, benefiting the business.

Compliance and Security 

Data catalogs that ensure data assets comply with privacy standards and security regulations, and reduce the risks of data breaches, cyberattacks, or legal fiascos.

New Business Opportunities

By giving data citizens new information they can incorporate into their work and decision-making, they will find new ways to answer work challenges and achieve their business goals. This can open up new business opportunities, across all departments.

Better Decision Making

Lack of data visibility makes organizations rely on tribal knowledge, rely on data they are already familiar with, or recreate assets that already exist. This creates organizational data silos, which impede productivity. Enabling data access to everyone improves the ability to find and use data consistently and continuously across the organization.

What Does a Data Catalog Contain?

Different data catalogs offer somewhat different features. However, to enable data governance and advanced analysis, they should all provide the following to data consumers:

Metadata

Technical Metadata

The data that describes the structure of the objects, like tables, schemas, columns, rows, file names, etc.

Business Metadata

Data about the business value of the data, like its purpose, compliance info, rating, classification, etc.

Process Metadata

Data about the asset creation process and lineage, like who changed it and when, permissions, latest update time, etc.

Search Capabilities

Searching, browsing, and filtering options to enable data consumers to easily find the relevant data assets.

Metadata Enrichment

The ability to automatically enrich metadata through mappings and connections, as well as letting data citizens manually contribute to the metadata.

Compliance Capabilities

Embedded capabilities that ensure data can be trusted and no sensitive data is exposed. This is important for complying with regulations, standards, and policies. 

Asset Connectivity

The ability to connect to and automatically map all types of data sources your organization uses, at the locations they reside at.

In addition, in technologically advanced and enterprise data catalogs, AI and machine learning are implemented.

Data Catalog Use Cases

Data catalogs can and should be consumed by all people in the organization. Some popular use cases include:

  • Optimizing the data pipeline
  • Data lake modernization
  • Self-service analytics
  • Cloud spend management
  • Advanced analytics
  • Reducing fraud risk
  • Compliance audits
  • And more

Who Uses a Data Catalog?

A data catalog can be used by data-savvy citizens, like data analysts, data scientists and data engineers. But all business employees – product, marketing, sales, customer success, etc – can work with data and benefit from a data catalog. Data catalogs are managed by data stewards.

Top 10 Data Catalog Tools

Here are the top 10 data catalog tools according to G2, as of Q1 2022:

1. AWS

  • Product Name: AWS Glue
  • Product Description: AWS Glue is a serverless data integration service for discovering, preparing, and combining data for analytics, machine learning and application development. Data engineers and ETL developers can visually create, run, and monitor ETL workflows. Data analysts and data scientists can enrich, clean, and normalize data without writing code. Application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

2. Aginity

  • Product Name: Aginity
  • Product Description: Aginity provides a SQL coding solution for data analysts, data engineers, and data scientists so they can find, manage, govern, share and re-use SQL rather than recode it.

3. Alation

  • Product Name: Alation Data Catalog
  • Product Description: ​​Alation’s data catalog indexes a wide variety of data sources, including relational databases, cloud data lakes, and file systems using machine learning. Alation enables company-wide access to data and also surfaces recommendations, flags, and policies as data consumers query in a built-in SQL editor or search using natural language. Alation connects to a wide range of popular data sources and BI tools through APIs and an Open Connector SDK to streamline analytics.

4. Collibra

  • Product Name: Collibra Data Catalog
  • Product Description: Collibra ensures teams can quickly find, understand and access data across sources, business applications, BI, and data science tools in one central location. Features include out-of-the-box integrations for common data sources, business applications, BI and data science tools; machine learning-powered automation capabilities; automated relationship mapping; and data governance and privacy capabilities.

5. IBM

  • Product Name: IBM Watson Knowledge Catalog
  • Product Description:  A data catalog tool based on self-service discovery of data, models and more. The cloud-based enterprise metadata repository activates information for AI, machine learning (ML), and deep learning. IBM’s data catalog enables stakeholders to access, curate, categorize and share data, knowledge assets and their relationships, wherever they reside.

6. Appen

  • Product Name: Appen
  • Product Description: Appen provides a licensable data annotation platform for training data use cases in computer vision and natural language processing. In order to create training data, Appen collects and labels images, text, speech, audio, video, and other data. Its Smart Labeling and Pre-Labeling features that use Machine Learning ease human annotations.

7. Denodo

  • Product Name: Denodo
  • Product Description: Denodo provides data virtualization that enables access to the cloud, big data, and unstructured data sources in their original repositories. Denodo enables the building of customized data models for customers and supports multiple viewing formats.

8. Oracle

  • Product Name: Oracle Enterprise Metadata Management 
  • Product Description: Oracle Enterprise Metadata Management harvests metadata from Oracle and third-party data integrations, business intelligence, ETL, big data, database, and data warehousing technologies. It enables business reporting, versioning, and comparison of metadata models, metadata search and browsing, and data lineage and impact analysis reports.

9. Unifi

  • Product Name: Unifi Data Catalog
  • Product Description: A standalone Data Catalog with intuitive natural language search powered by AI, collaboration capabilities for crowd-sourced data quality, views of trusted data, and all fully governed by IT. The Unifi Data Catalog offers data source cataloging, search and discovery capabilities throughout all data locations and structures, auto-generated recommendations to view and explore data sets and similar data sets, integration to catalog Tableau metadata, and the ability to deconstruct TWBX files and see the full lineage of a data source to see how data sets were transformed.

10. BMC

  • Product Name: Catalog Manager for IMS
  • Product Description: A system database that stores metadata about databases and applications. Catalog Manager for IMS enables viewing IMS catalog content, reporting on the control block information in the IMS catalog, and creating jobs to do DBDGENs, PSBGENs, and ACBGENs to populate the catalog.

Data Lakes and Data Catalogs

A data catalog can organize and govern data that reside in repositories, data lakes, data warehouses, or other locations. A data catalog can help organize the unstructured data in the data lake, preventing it from turning into a “data swamp”. As a result, data scientists and data analysts can easily pull data from the lake, evaluate it and use it.

A Data Catalog and Databand

Databand is a proactive observability platform for monitoring and controlling data quality, as early as ingestion. By integrating Databand with your data catalog, you can gain extended lineage, and visualize and observe the data from its source and as it flows through the pipelines all the way to the assets the data catalog maps and governs. As a result, data scientists, engineers and other data professionals can see and understand the complete flow of data, end-to-end.

In addition, by integrating Databand with your data catalog, you can get proactive alerts any time your data quality is affected to increase governance and robustness. This is enabled through Databand’s data quality identification capabilities, combined with how data catalogs map assets to owners. Databand will communicate any data quality issues to the relevant data owners.

What is a Data Mesh Architecture?

Databand
2022-04-12 11:10:00

Intro to Data Mesh

A data mesh is a form of platform architecture. 

The goal of the data mesh in organizing a business’ platforms is to maximize the value of analytical data. This is done by minimizing the time needed to access quality data. A well-designed data mesh delivers cutting-edge efficiency, allowing researchers to quickly access data from any data accessible source within the data mesh system. The data mesh model may replace data lakes as the most popular way to store and retrieve data.

Three components support data mesh architecture: domain-supported data pipeline, data sources, and data infrastructure. There are layers of observability, data governance, and universal interoperability. 

Data mesh systems are useful for businesses with multiple data domains. 

Many companies have data stored in different databases and formats, causing research and analytics problems. Some companies have attempted to resolve these problems by creating a single data warehouse or central data lake and downloading all data to it. This solves its problems, such as accessing an inaccurate copy of the original data and outdated information. 

Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage.   

Data mesh architecture allows data access from a number of locations rather than one central data warehouse or data lake. 

(It should be noted that there are situations where it is completely appropriate to build a central data lake as an additional part of the data mesh system.)

The Data Mesh Philosophy

The primary goal of data mesh is to create a system that maximizes the value of analytical data. The data mesh philosophy embraces a constantly changing data landscape, including increasing sources of data, the ability to transform data from one format to another, and improving the response time to change. 

Four principles support the data mesh model:

  1. Federated computational governance.
  2. Domain-oriented and decentralized data ownership, as well as architecture. 
  3. A self-serve platform as part of the data infrastructure.
  4. Data-as-a-product rather than a by-product. 

Governance

Data mesh uses a system called federated computational governance. A federated model includes a cross-domain agreement describing which parts of the governance are managed by the data domains and which are handled by the provider. It is an autonomous system that is normally built and maintained by independent data teams for each domain. (Independent data teams can be made up of in-house staff or outside contractors). To get the maximum value, interoperability between data domains is a necessity. 

The “federation” is a group of people made up of domain owners and the data mesh provider. While using a framework of globalized rules they decide on how to best govern the data mesh system.

Ideally, the governance federation will establish a data governance program that is common for all the domain owners. Domain owners can still develop their own data governance program, but an agreement providing a base level of data quality for the group as a whole will provide more trustworthy distributed data.

Decentralized Data Ownership

The concept of decentralized data ownership describes an architectural model in which data is not owned by a specific domain (department or business partner) but is freely shared with other business domains. 

In the data mesh model, data is not owned or controlled by the people storing it – rather, it is stored and managed by the department or business partner, understanding that the data is meant to be shared. 

The goal of the department or partner storing the data should be to offer it in a way that is easy to access and easy to work with.

The Self-Service Platform

The data mesh self-service platform, part of the architectural design, supports functionality from storage and processing to the data catalog. The self-service platform is an essential feature. The host or provider should supply a development platform that domain engineers can use for integrating the platform into their domain. 

The model supports the use of autonomous domains. A “network” is a group of computers capable of communicating with each other and is needed to create a domain. A domain describes workstations, devices, computers, and database servers sharing data by way of network resources. 

The self-service platform must be domain-agnostic (capable of working with multiple data domains) for the system to work. This allows each domain to be customized as needed. Additionally, the domain’s data engineering teams have the freedom to develop and design solutions for their specific issues. This design provides both flexibility and efficiency.

According to Zhamak Dehghani, the creator of the data mesh model, useful features for the data catalog include:

  • Data governance and standardization
  • Encryption for the data, both at rest and in motion
  • Data discovery, catalog registration, and publishing
  • Data schema
  • Data production lineage
  • Data versioning
  • Data quality metrics
  • Data monitoring, alerting, and logging

Monolithic Data Architectures vs Data Mesh Architecture

A good example of monolithic data architectures is a relational database management system (RDBMS) using a SQL database. The word monolithic means “all in one piece” rather than “too large and unable to be changed.” The phrase ‘monolithic data architectures’ describes a database management system using a variety of integrated software programs that work together to process data. With this design, data is not typically available for sharing with other organizations.

On the other hand, data mesh promotes data democratization and data sharing by allowing data-driven consumers to access data across all associated organizations. This results in more businesses making a profit from the same data. 

A data mesh is decentralized and supports data owners sharing their data, being responsible for their own domains, and handling their own data products and pipelines. Sharing in the data mesh includes making their data available user-friendly and easily consumable. 

The data mesh supports near-real-time data sharing because the data transmitted between domains use a “change data capture” (CDC) mechanism.

Data-as-a-Product

The data-as-a-product principle is an important foundation of the data mesh model and is philosophically opposed to data silos. The data mesh philosophy supports sharing data, and the purpose of a data silo is to isolate data. Data silos can be avoided through the use of cross-domain governance (per the federation) and semantic linking of data.

Data-as-a-product (as opposed to data-as-a-service) is used for decision making, developing personalized products, and fraud detection. Data-as-a-service tends to focus more on insights and strategy. Features such as trustworthiness, discoverability, and understandability are necessary for data to be treated as a product.

Preventing Data Silos

Data mesh systems eliminate the use of data silos. Data silos are data collections within an organization that has become isolated. The data it contains is typically available to one department but cannot be accessed by other parts of the business. This distorts the ability of good decision-making. 

Silos are dangerous because they limit management’s understanding of the business, effectively blocking useful information.

Improved Data Analytics

In the last decade, the use of data analytics has increased steadily. Consequently, businesses are continuously attempting to improve the quality of their data. The data mesh model offers improved data collection and a remarkably efficient way of storing and managing data. It offers clean, accurate data for data analytics.

Data Pipelines

Data pipelines are an important part of the data mesh architectural model. As organizations take on increasingly complex analytic projects, data pipelines can assist in supplying quality data.

The data mesh model supports the total customization of data pipelines.

A data pipeline is made up of a data source, a series of processing steps, and a destination. If the desired data is not located within the data platform, then it is collected at the beginning of the pipeline. After the collection, a number of steps are taken, with each step delivering an output that becomes the input for the next step. 

A data pipelines process data between the initial ingestion source and the final destination. Steps that are common in a data pipeline include: 

  • Data transformation 
  • Filtering
  • Augmentation
  • Enrichment
  • Aggregating
  • Grouping
  • Running of algorithms against that data

These pipeline steps can be performed in parallel or in a time-sliced fashion.

Data Catalogs

A data catalog is the organized inventory of data for an organization. Metadata is used to help businesses organize and manage their data. The data catalog also uses metadata to help with data discovery and data governance. Data catalogs scan metadata automatically, allowing the catalog’s data consumers to seek and find their data. This includes information about the data’s availability, quality, and freshness.

Part of a data catalog’s function is to serve different end-users (data analysts, data scientists, business analysts, etcetera) who probably have different goals. A good data catalog will be user-friendly and flexible enough to adapt to its end-user’s needs. 

As with the data pipeline, the data catalog supports data governance, offering a more thorough process. Data catalogs use a bottom-up approach to create an agile data governance program. People can use data catalogs to document legal obligations and track the life cycle of data.

Data Observability

Another benefit is data observability. It is a part of the data mesh architecture and part of its strategy. Data observability provides a pulse check on the data’s health and is also considered a best practice for businesses. Data observability uses various tools designed to manage and track an organization’s data reliability and quality.

Databand offers a proactive data observability platform that integrates into the data mesh architecture. The platform allows users to identify anomalies and see trends in the pipeline metadata. It can profile column statistics and explain the causes of unreliable data and its impact.

What is a Modern Data Platform? Understanding the Key Components

Databand
2022-04-06 13:48:03

A modern data platform should provide a complete solution for the processing, analyzing, and presentation of data. It is built as a cloud-first, cloud-native platform, and, normally, can be set up within a few hours. A modern data platform is supported not only by technology, but also by the Agile, DevOps, and DataOps philosophies.

Currently, data lakes and data warehouses are popular storage systems, but each comes with some limitations.

Data lakehouses and data mesh storage systems are two new systems attempting to overcome those limitations, and are showing signs of gaining popularity.

The modern data platform typically includes six foundational layers guided by principles of elasticity and availability.

Data Platform

The Philosophies

DevOps and DataOps have two entirely different purposes, but both are similar to the Agile philosophy, which is designed to accelerate project work cycles.

DevOps is focused on product development, while DataOps focuses on creating and maintaining a distributed data architecture system with the goal of creating business value from data.

Agile is a philosophy for software development that promotes speed and efficiency, but without eliminating the “human” factor. It places an emphasis on face-to-face conversations as a way to maximize communications and emphasizes automation as a way to minimize errors.

Data Ingestion

The process of placing data into a storage system for future use is called data ingestion. In simple terms, data ingestion means moving data taken from other sources to a central location. From there the data can be used for record-keeping purposes, or for further processing and analysis. Both analytics systems and downstream reporting rely on accessible, consistent, and accurate data.

Data Ingestion

Organizations make business decisions using the data from their analytics infrastructure. The value of their data is dependent on how well it is ingested and integrated. If there are problems during the ingestion process, such as missing data, every step of the analytics process will suffer.

Batch processing vs stream processing

Ingesting data can be done in different ways, and the way a particular data ingestion layer is designed can be based on different processing models. Data can come from a variety of distinct sources, ranging from SaaS platforms to the internet of things to mobile devices. A good ingestion model acts as a foundation for an efficient data strategy, and organizations normally choose the model best-suited for the circumstances.

Batch processing is the most common form of data ingestion. But it is not designed to deal with customers in real time. Instead it collects and groups source data into batches, which are sent  to the destination. 

Batch processing may be initiated using a simple schedule, or it may be activated when certain conditions exist.  It is often used when the use of real-time data is not needed, as it is usually easier and less expensive than streaming ingestion.

Real-time processing (also referred to as streaming or stream processing) does not group data. Instead, data is obtained, transformed, and loaded as soon as it is recognized. Real-time processing is more expensive because it requires constant monitoring of data sources and accepts new information, automatically. 

Data Pipelines

Modern data ingestion models, until recently, used an ETL (extract, transform, load procedure) to take data from its source, reformatting it, and then transporting it to its destination. This made sense when businesses had to use expensive in-house analytics systems, and doing the prep work before delivering it, including transformations, lowered costs.

That situation has changed, and more updated cloud data warehouses (Snowflake, Google BigQuery, Microsoft Azure, and others) can now cost-effectively scale their computing and storage resources. These improvements allow the preload transformation steps to be dropped, with raw data being delivered to the data warehouse.

At this point, the data can be translated into an SQL format, and then run within the data warehouse during research. This new processing arrangement has changed ETL to ELT (extract, load, transform). 

Instead of extracting the data and then transforming it, with ELT data is transformed “after” it is in the cloud’s data warehouse.

Data Transformation

Data transformation deals with changing the values, structure, and format of data. This is often necessary for data analytics projects. Data can be transformed during one of two stages when using a data pipeline, before arriving at its storage destination, or after. Organizations still using on-premises data warehouses will normally use an ETL process.

Today, many organizations are using cloud-based data warehouses. These can scale computing and storage resources as needed. The ability of the cloud to scale allows businesses to bypass the preload transformations and send raw data into the data warehouse. The data is transformed after arriving, using an ELT process, typically when answering a query. 

There are various advantages to transforming data:

  • Usability – Too many organizations sit on a bunch of unusable, unanalyzed data. Standardizing data and putting it under the right structure allows your data team to generate business value out of it.
  • Data quality – Transforming raw data can lead to missing values, poorly formatted variables, null rows, etcetera. (It is also possible to use data transformation to “improve” data quality.) 
  • Better organization – transformed data is easier to process for both people and computers

Data Storage and Processing

Currently, The two most popular storage formats are data warehouses and data lakes. And then there are two storage formats that are gaining in popularity — the data lakehouse and data mesh. Modern data storage systems are focused on using data efficiently. 

Data Storage and Processing

The Data Warehouse

Cloud-based data warehouses have been the preferred data storage system for a number of years because they can optimize computing power and processing speeds. They were developed much earlier than data lakes and can be traced back to the 1990s when databases were used for storage. The early versions of data warehouses were in-house and had very limited storage capacity. In 2013, many data warehouses shifted to the cloud and gained scalable storage. 

The Data Lake

Data lakes were originally built on Hadoop, were scalable, and were designed for on-premises use. In January of 2008, Yahoo released Hadoop (based on NoSQL) as an open-source project to the Apache Software Foundation. Unfortunately, the Hadoop ecosystem is extremely complex and difficult to work with. Data Lakes began shifting to the cloud around 2015, making them much less expensive, and much more user-friendly.

Using a combination of data lakes and data warehouses to minimize their limitations has become a common practice. 

The Data Lakehouse 

Data lakes have problems with “parsing data.” They were originally designed to collect data in its natural format, without enforcing schema (formats), so that researchers could gain more insights from a broad range of data. Unfortunately, data lakes can become data swamps, with old, inaccurate information and useless information, making them much less effective.

Data warehouses are designed for managing structured data with clear and defined use cases. 

For the data warehouse to function properly, the data must be collected, reformatted, cleaned, and uploaded to the warehouse. Some data, which cannot be reformatted, may be lost. 

The data lakehouse has been designed to merge the strengths of data warehouses and lakes. 

Data lakehouses are a new form of data management architecture. They merge the flexibility, cost-efficiency, and scaling abilities of data lakes with the ACID transactions and data management features of data warehouses. 

Data lakehouses support business intelligence and machine learning. One of the data lakehouse’s strengths is its use of metadata layers. It also uses a new query engine, designed for high-performance SQL searches.

Data Mesh

Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage. 

Data mesh, unlike data warehouses, lakes, and lakehouses, is “decentralized.” Decentralized data ownership is an architectural model where a specific domain (business partners or other departments) does not own their data, but shares data freely with other domains. 

Data is not owned in the data mesh model. It is not owned by the people storing it — but they are responsible for it. The data is stored and organized by the business partner or department, with the knowledge the data is to be shared. This means all data within the data mesh system should maintain a uniform format.

Data mesh systems can be useful for businesses supporting multiple data domains. Within the data mesh design, there is a data governance layer and a layer of observability. There is also a universal interoperability layer. 

Data Observability

Data observability has recently become a hot topic. Data observability describes the ability to watch and observe the state of data and its health. It covers a number of activities and technologies that, when combined, allow the user to identify and resolve data difficulties in near real-time.

Data observability platforms can be used with data warehouses, data lakes, data lakehouses, and data mesh. 

It should be noted Databand has developed what is called a proactive data observability platform capable of catching bad data before it causes damage. 

Observability allows teams to answer specific questions about what is taking place behind the scene in extremely distributed systems. Observability can show where data is moving slowly and what is broken.  

Managers and/or teams can be sent alerts about potential problems and pro-actively solve them. (While the predictability feature can be helpful, it will not catch all problems, nor should it be expected to. Think of problem predictions as helpful, but not a guarantee.) 

To make data observability useful, it needs to include these features:

  • SLA Tracking – This feature measures pipeline metadata and data quality against pre-defined standards.
  • Monitoring – A dashboard is provided, showing the operations of your system or pipeline.
  • Logging – Historical records (tracking, comparisons, analysis) of events are kept for comparison with newly discovered anomalies.
  • Alerting – Warnings are sent out for both anomalies and expected events.
  • Analysis – An automated detection process that adapts to your system.
  • Tracking –  Offers the ability to track specific events.
  • Comparisons – Provides a historical background, and anomaly alerts.

For many organizations, observability is siloed, meaning only certain departments can access the data. (This “should not” happen in a data mesh system, which philosophically requires the data to be shared, and is generally discouraged in most storage and processing systems.) Teams collect metadata on the pipelines they own. 

Business Intelligence & Analytics

In 1865, the phrase ‘Business Intelligence’ was used in the Cyclopædia of Commercial and Business Anecdotes. This described how Sir Henry Furnese (who was a banker) profited from the information he gathered, and how he used it before his competition.

Currently, a great deal of business information is gathered from business analytics, as well as data analytics. Analytics is used to generate business intelligence by transforming data into understandable insights which can help to make tactical and strategic business decisions. Business intelligence tools can be used to access and analyze data, providing researchers with detailed intelligence.

Data Discovery

Data discovery involves collecting and evaluating data from different sources. It is often used to gain an understanding of the trends and patterns found in the data. Data discovery is sometimes associated with business intelligence because it can bring together siloed data for analysis. 

Data discovery includes connecting a variety of data sources. It can clean and prepare data, and perform analytics. Inaccessible data is essentially useless data, and data discovery makes it useful. 

Data discovery is about exploring data with visual tools which can help business leaders detect new patterns and anomalies.

What’s Coming Next?

If you search for “Modern Data Platform Trends” in Google, you’ll see many articles discussing trends on what’s next for the data platform. Topics like metadata management, building a metrics layer, and reverse ETL are getting a lot of focus.

However, the trend of data observability seems universally pervasive in all these articles. Data-driven companies can’t afford to constantly question whether or not the data they consume is reliable and trustworthy.