Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

What is Data Lineage?

Databand
2022-07-28 10:20:00

The term “data lineage” has been thrown around a lot over the last few years.

What started as an idea of connecting between datasets quickly became a very confusing term that now gets misused often.

It’s time to put order to the chaos and dig deep into what it really is. Because the answer matters quite a lot. And getting it right matters even more to data organizations. 

This article will unpack everything you need to know about data lineage, including:

  • What is it?
  • What’s the difference between data lineage and data provenance?
  • Why is it important?
  • What are common data lineage techniques?
  • What are data lineage best practices?
  • What is end-to-end lineage vs. data at rest lineage?
  • What are the benefits of end-to-end data lineage?
  • What should you look for in a data lineage tool?

What is data lineage?

Its purpose is to track data throughout its complete lifecycle. It looks at the data source to its end location and notes any changes (including what changed, why it changed, and how it changed) along the way. And it does all of this visually.

Usually, it provides value in two key areas:

  1. Development process: Knowing what affects what and what could be the impact of making changes. 
  2. Debugging process: Understanding the severity, impact, and root cause of issues.

In general, it makes it possible to identify errors in data, reduce the risk associated with system and process changes, and increase trust in data. All of these are essential at a time when data plays such an integral role in business outcomes and decision-making.

Data lineage in action: A simplified example

When data engineers talk about it, they often imagine a data observability platform that allows them to understand the logical relationship between datasets that are affecting each other in a specific business flow.

data lineage

In this very simplified example, we can see an ELT:

  • Some pipeline tasks, probably running by Airflow, are scraping external data sources and collecting data from there.
  • Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse).
  • Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.
  • Dashboarding tools such as Tableau, Looker, or Power BI are being used on top of the business data and providing visibility to multiple stakeholders.

What’s the difference between data lineage and data provenance?

Data lineage and data provenance are often viewed as one and the same. While the two are closely related, there is a difference.

Whereas data lineage tracks data throughout the complete lifecycle, data provenance zooms in on the data origin. It provides insight into where data comes from and how it gets created by looking at important details like inputs, entities, systems, and processes for the data.

Data provenance can help with error tracking when understanding data lineage and can also help validate data quality.

Why is it important?

As businesses use more big data in more ways, having confidence in that data becomes increasingly important – just look at Elon Musk’s deal to buy Twitter for an example of trust in data gone wrong. Consumers of that data need to be able to trust in its completeness and accuracy and receive insights in a timely manner. This is where data lineage comes into play.

Data lineage instills this confidence by providing clear information about data origin and how data has moved and changed since then. In particular, it is important to key activities like:

  • Data governance: Understanding the details of who has viewed or touched data and how and when it was changed throughout its lifecycle is essential to good data governance. Data lineage provides that understanding to support everything from regulatory compliance to risk management around data breaches. This visibility also helps ensure data is handled in accordance with company policies.
  • Data science and data analytics: Data science and data analytics are critical functions for organizations that are using data within their business models, and powering strong data science and analytics programs requires a deep understanding of data. Once again, data lineage offers the necessary transparency into the data lifecycle to allow data scientists and analysts to work with the data and identify its evolutions over time. For instance, data lineage can help train (or re-train) data science models based on new data patterns.
  • IT operations: If teams need to introduce new software development processes, update business processes, or adjust data integrations, understanding any impact to data along the way –  as well as where data might need to come from to support those processes – is essential. Data lineage not only delivers this visibility, but it can also reduce manual processes associated with teams tracking down this information or working through data silos.
  • Strategic decision making: Any organization that relies on data to power strategic business decisions must have complete trust that the data they’re using is accurate, complete, and timely. Data lineage can help instill that confidence by showing a clear picture of where data has come from and what happened to it as it moved from one point to another.
  • Diagnosing issues: Should issues arise with data in any way, teams need to be able to identify the cause of the problem quickly so that they can fix it. The visibility provided by data lineage can help make this possible by allowing teams to visualize the path data has taken, including who has touched it and how and when it changed.

What are common techniques?

There are several commonly used techniques for data lineage that collect and store information about data throughout its lifecycle to allow for a visual representation. These techniques include:

  • Pattern-based lineage: Evaluates metadata for patterns in tables, columns, and reports rather than relying on any code to perform data lineage. This technique focuses directly on the data (vs. algorithms), making it technology-agnostic; however, it is not always the most accurate technique.
  • Self-contained lineage: Tracks data movement and changes in a centralized system, like a data lake that contains data throughout its entire lifecycle. While this technique eliminates the need for any additional tools, it does have a major blind spot to anything that occurs outside of the environment at hand.
  • Lineage by data tagging: A transformation engine that tags every movement or change in data allows for lineage by data tagging. The system can then read those tags to visualize the data lineage. Similar to self-contained lineage, this technique only works for contained systems, as the tool used to create the tags will only be able to look within a single environment.
  • Lineage by parsing: An advanced form of data lineage that reads logic used to process data. Specifically, it provides end-to-end tracing by reverse engineering data transformation logic. This technique can get complicated quickly, as it requires an understanding of all the programming logic used throughout the data lifecycle (e.g. SQL, ETL, JAVA, XML, etc.).

What are data lineage best practices?

When it comes to introducing and managing data lineage, there are several best practices to keep in mind:

  • Automate data lineage extraction: Manual data lineage centered around spreadsheets is no longer an option. Capturing the dynamic nature of data in today’s business environments requires an automated solution that can keep up with the pace of data and reduce the errors associated with manual processes.
  • Bring metadata source into data lineage: Systems that handle data, like ETL software and database management tools, all create metadata – or data about the data they handle (meta, right?). Bringing this metadata source into data lineage is critical to gaining visibility into how data was used or changed and where it’s been throughout its lifecycle.
  • Communicate with metadata source owners: Staying in close communication with the teams that own metadata management tools is critical. This communication allows for verification of metadata (including its timeliness and accuracy) with the teams that know it best.
  • Progressively extract metadata and lineage: Progressive extraction – or extracting metadata and lineage in the same order as it moves through systems – makes it easier to do activities like mapping relationships, connections, and dependencies across the data and systems involved.
  • Progressively validate end-to-end data lineage: Validating data lineage is important to make sure everything is running as it should. Doing this validation progressively by starting with high-level system connections, moving to connected datasets, then elements, and finishing off with transformation documentation simplifies the process and allows it to flow more logically.
  • Introduce a data catalog: Data catalog software makes it possible to collect data lineage across sources and extract metadata, allowing for end-to-end data lineage.

What is end-to-end lineage vs. data at rest lineage?

When talking about lineage, most conversations usually tackle the scenario of data “in-the-warehouse,” which presumes everything is occurring in a contained data warehouse or data lake. In these cases, it monitors data executions that are performed on specific or multiple tables to extract the relationship within or among them. 

At Databand, we refer to this as “data at rest lineage,” since it observes the data after it was already loaded into the warehouse.

This data at rest lineage can be troublesome for modern data organizations, which typically have a variety of stakeholders (think: data scientist, analyst, end customer), each of which has very specific outcomes they’re optimizing toward. As a result, they each have different technologies, processes, and priorities and are usually siloed from one another. Data at rest lineage that looks at data within a specific data warehouse or data lake typically doesn’t work across these silos or data integrations.

Instead, what organizations need is end-to-end data lineage, which looks at how data moves across data warehouses and data lakes to show the true, complete picture.

Consider the case of a data engineer who owns end-to-end processes within dozens of dags in different technologies. If that engineer encounters corrupted data, they want to know the root cause. They want to be able to proactively catch issues before they land on business dashboards and to track the health of the different sources on which they rely. Essentially, they want to be able to monitor the real flow of the data.

With this type of end-to-end lineage, they could see that a SQL query has introduced corrupted data to a column in a different table or that a DBT test failure has affected other analysts’ dashboards. In doing so, end-to-end lineage captures data in motion, resulting in a visual similar to the following:

data lineage

What are the benefits of end-to-end data lineage?

Modern organizations need true end-to-end lineage because it’s no longer enough just to monitor a small part of the pipeline. While data at rest lineage is easy to integrate, it provides very low observability across the entire system.

Additionally, data at rest lineage is limited across development languages and technologies. If everything is SQL-based, that’s one thing. But the reality is, modern data teams will use a variety of languages and technologies for different needs that don’t get covered with the more siloed approach.

As if that wasn’t enough, most of the issues with data happen before it ever reaches the data warehouse, but data at rest lineage won’t capture those issues. If teams did have that visibility though, they could catch issues sooner and proactively protect business data from corruption.

End-to-end data lineage solves these challenges and delivers several notable benefits, including:

  • Clear visibility on impact: If there’s a schema change in the external API from which Python fetches data, teams need true end-to-end visibility to know which business dashboard will be affected. Gaining that visibility requires understanding the path of data in motion across environments and systems – something only end-to-end data lineage that tracks data in motion can provide.
  • Understanding of root cause: By the time an issue hits a table used by analysts, the problem is already well underway, stemming from further back in the data lifecycle. With data at rest lineage, it’s only possible to see what’s happening in that particular table, though – which isn’t helpful for identifying the cause of the issue. End-to-end lineage, on the other hand, can look across the complete lifecycle to provide clarity into the root cause of issues, wherever they turn up.
  • Ability to connect between pipelines and datasets: In a very complex environment where thousands of pipelines (or more!) are writing and reading data from thousands of datasets, the ability to identify which pipeline is working on a weekly, daily, or hourly bases and with which tables (or even specific columns within tables) is a true game-changer.

What should you look for in a data lineage tool?

As it becomes increasingly important, what should you look for in a data lineage tool? 

Above all else, you need a tool that can power end-to-end data lineage (vs. data at rest lineage). You also need a solution that can automate the process, as manual data lineage simply won’t cut it anymore.

With those prerequisites in mind, other capabilities to consider when evaluating a data lineage tool include:

  • Alerts: Automated alerts should allow you to not just identify that an incident has occurred, but gain context on that incident before jumping into the details. This context might include high-level details like the data pipeline experiencing an issue and the severity of the issue.
  • View of affected datasets: The ability to see all of the datasets impacted by a particular issue in a single, birds-eye view is helpful to understanding the effect on operations and the severity of the issue.
  • Visual of data lineage: Visualizing data lineage by seeing a graph of relationships between the data pipeline experiencing the issue and its dependencies allows you to gain a deeper understanding of what’s happening and what’s affected as a result. The ability to click into tasks and see the dependencies and impact to each one for a given task provides even more clarity when it comes to issue resolution.
  • Debugging within tasks: Finally, the ability to see specific errors within specific tasks allows for quick debugging of issues for faster resolution.

Getting it right

Data lineage isn’t a new concept, but it is one that’s often misunderstood. However, as data becomes more critical to more areas of business, getting it right is increasingly important.

It requires an understanding of exactly what data lineage is and why it’s so important. Additionally, it requires a thoughtful approach to addressing data lineage that matches the needs of a modern data organization – which means true end-to-end data lineage. And finally, it requires the right tool to support this end-to-end lineage in an automated way.

implement end-to-end data lineage

What is Data Governance and Where Observability Fits In

Databand
2022-07-25 11:18:47

Data is the most valuable asset for most businesses today. Or at least it has the potential to be. But to realize the full value, organizations must manage their data correctly. This management covers everything from how it’s collected to how it’s maintained and analyzed. And a big component of that is data governance.

Data governance refers to the policies, processes, roles, and technology that businesses use to ensure data availability, usability, integrity, and security. This article will explore everything you need to know about data governance, including:

  • What is it?
  • What’s the difference between data governance vs. data management?
  • Why is it important?
  • What are the components of effective data governance?
  • What are the key roles involved in it?
  • What are its best practices?

What is Data Governance?

Data governance is a core component of any big data management strategy that organizations introduce to drive insights. Effective data governance ensures quality and consistency in the data used to power critical business decisions.

At a high level, it can refer to data roles and responsibilities, data accessibility, data policies and processes, data creation procedures, data flows, and more. Digging deeper, it defines the architecture for decision-making and access rights around data, answering questions like:

  • How do we define data?
  • Where does data come from?
  • How do we confirm the quality of data?
  • How do we use data?
  • Where do we store data?
  • How do we protect data?
  • How do we organize data?
  • How do we connect data across systems?
  • How do we maintain a current inventory of our data?
  • How accurate does that data inventory need to be?

Software for data governance can either be purpose-built or baked into applications that make up the modern data stack.

What’s the Difference Between Data Governance vs. Data Management?

Data governance and data management are often used interchangeably, however, the two terms refer to different practices.

It sets the strategy by introducing policies and procedures throughout the data lifecycle. Meanwhile, data management is the practice of enforcing those policies and procedures so that the data is ready for use.

In short, it is the cornerstone of all data management initiatives.

Why is Data Governance Important?

In today’s data-driven world, organizations need effective data governance to be able to trust in the quality and consistency of their data. 

A strong approach to data governance benefits the entire organization by giving individuals a clear way to access data, shared terminology to discuss data, and a standard way to understand data and make it meaningful.

Some of the key benefits of data governance include:

  • Introducing a clear data quality framework to bring together data and create a shared understanding for better insights and decisions
  • Improving consistency of data across systems and processes, for efficient data integration
  • Clearly defining policies and procedures around data-related activities to ensure standardization across the entire organization
  • Outlining roles and responsibilities in terms of data management and data access for clarity among stakeholders
  • Improving compliance by allowing for faster response and resolution to data incidents

On the flip side, poor data governance can hamper regulatory compliance initiatives, which can create problems for companies when it comes to satisfying new data privacy and protection laws.

What are the Components of Effective Data Governance?

In order for it to be effective, it must encompass several key components that support the follow-on data management activities. These components include:

Data Standards

It should set explicit data standards for consistency across the entire organization. These standards should assess and verify data quality and should be transparent to everyone in the company. As a result, they should help teams better comprehend and use data.

Data standards should also allow any third-party auditors to easily see how the organization handles sensitive data, how that data gets used, and why it gets used in that way. This transparency is essential for compliance, especially in the case of a data breach.

Data Integration

Data integration brings together data from diverse sources to make data more readily available and power deeper insights. Good data governance requires a complete understanding of how data gets integrated across systems and processes. Specifically, the data governance program should define the tools, policies, and procedures used to pass data across systems and combine information.

As a best practice, these data integration guidelines should be clear and easy to follow to ensure every new system adheres to them. Additionally, the team responsible for data governance should assist in reviewing these guidelines during any new technology implementations.

Data Security

Protecting the security of data is essential, as any unauthorized access to data or even loss of data can pose serious risks – from dangers to the subjects of data to financial loss to reputational damage. A data governance framework outlines a variety of elements related to data security, including where data is stored, how it’s accessed, and what level of availability it has.

Specifically, it should detail defenses like authentication tools and encryption algorithms that need to be implemented to protect the data network. Then, any teams working on data governance should partner closely with IT security to ensure adequate protection measures are in place based on those guidelines.

Data Lifecycle Management

Understanding the organization’s data lifecycle means knowing where data resides at any given time as it moves through systems until it eventually gets discarded. Good data governance allows you to quickly discover and isolate data at any point in the lifecycle.

This concept, also known as data lineage, allows analysts to trace data back to its source to confirm trustworthiness.

Data Observability

Data observability allows you to understand the health and state of data in your system to identify and resolve issues in near real-time. It includes a variety of activities that go beyond just describing the problem, providing context to also resolve the problem and work to prevent it from recurring.

Data governance helps set the framework for data observability, setting guidelines for what to monitor and when and what thresholds should set off alerts when something isn’t right. A good data observability platform can handle these activities, making it important to choose a platform that can meet the requirements for identifying, troubleshooting, and resolving problems outlined in your strategy. 

Metadata Management

Another critical component of data governance is metadata management, which focuses on maintaining consistent definitions of data across systems. This consistency is important to ensure data flows smoothly across integrated solutions and that everyone has a shared understanding of the data.

The framework should include details on data definition, data security, data usage, and data lineage. In doing so, it should make it possible to clearly identify and classify all types of data in a standardized way across the organization.

Data Stewardship

Data stewardship is the practice that guarantees your organization’s data is accessible, usable, secure, and trustworthy. While the data governance strategy determines your organization’s goals, risk tolerance, security standards, and strategic data needs to set high-level policies, data stewardship focuses on making sure those policies get implemented.

To achieve this follow-through, data stewardship assigns clear roles and responsibilities for various initiatives outlined in the strategy. 

What are the Key Roles Involved in Data Governance?

Data governance programs can only succeed if they have clearly defined roles and responsibilities. As a result, it’s important to identify the right people within your organization to take on this ownership and establish their roles in the program.

Specifically, every data governance program requires people in three critical roles, each of which must be filled with qualified individuals who understand their specific responsibilities and how they contribute to the bigger picture. These roles include:

Chief Data Officer

The Chief Data Officer is the data governance leader. This person is responsible for overseeing the entire program, including enforcing and implementing all policies and procedures and leading the data committee and data stewards.

Data Committee

The data committee is a group of individuals that sets data governance policies and procedures, including rules for how data gets used and who can access it. They also resolve any disputes that arise regarding data usage or its role within the organization. The committee’s purpose is to promote data quality and ensure that data owners and data stewards have what they need at every point in the data lifecycle to do their jobs effectively.

Data Stewards

The data stewards are responsible for carrying out the data governance policies set by the data committee. They oversee data, making sure everything adheres to policies throughout the entire data lifecycle from creation to archival. The data stewards also train new staff on policies. 

In some cases, data stewards might also be the data owners. In other cases, those might be two separate groups. Either way, the data owners are the people who manage the systems that create and house data.

What are Data Governance Best Practices?

When it comes to getting data governance off the ground (or improving what your organization already has in place) there are several best practices to consider:

Get Buy-In from the Top

As with any initiative, buy-in for data governance needs to start at the top. This top-down buy-in is important to make sure that everyone in the organization adheres to data governance policies and that those who are in a position to influence that acceptance understand the importance of your work. 

To achieve this buy-in, share with executives how your data governance plan can help them realize their strategic objectives. The more you can highlight the advantages of the program and how it relates to their work, the easier it will be.

Communicate Often

Communication beyond top-level executives is essential to effective data governance. To ensure everyone is aware of what your team is doing around data governance and why it matters, make a list of everyone in the organization who has a stake in or would be affected by that work. 

Then establish regular communications to share updates about program changes, roadblocks, and successes, that way everyone knows where to go for updates and can stay informed on a regular basis.

Combine Long-Term Goals with Short-Term Gains

When it comes to data governance, you won’t be able to tackle everything at once. Instead, it should be a continuous effort to support data-driven decision-making and open up new opportunities for people throughout the organization. 

As a result, your long-term plan needs to include smaller, short-term initiatives that you can weave into the day-to-day operations of your company for immediate wins. This approach ensures that you see progress quickly and can help uncover any potential roadblocks faster. It also opens the door to new ideas that can even improve your long-term plan.

Assign Clear Responsibility – and Train People Accordingly

You can’t simply assign someone the role of data steward and hope for the best. You need to make sure that anyone playing a role in your data governance program takes their part seriously, and that means you need to take their responsibilities just as seriously. 

This means you need to be clear about the responsibilities that data stewards and data committee members take on and offer training to support those people in their data governance roles. This training should cover everything from why it is so important to what’s expected from people in different roles.

Audit Process Adoption

A big part of data governance involves developing processes for how the company will handle data, especially when it comes to sensitive information. Auditing how these processes are actually living in your organization and how well people are adopting them can be extremely informative as you continue to make program improvements. 

That’s because even the best processes won’t do your organization any good if no one adheres to them.

Regularly Measure Progress and Keep an Eye Toward Improvements

Finally, remember that data governance is not a one-and-done effort. It’s a program that must continuously evolve based on factors like adoption and changing business needs. 

As a result, it’s important to regularly check in on how policies are faring and the impact on data quality. The more you can measure that progress, the better you can manage the situation and identify what’s working well and what needs to be improved.

Resolve problems outlined in your data governance strategy.

The Data Value Chain: Data Observability’s Missing Link

Databand
2022-05-18 13:18:01

Data observability is an exploding category. It seems like there is news of another data observability tool receiving funding, an existing tool is announcing expanded functionality, and many new products in the category are being dreamt up. After a bit of poking around, you’ll notice that many of them claim to do the same thing: end-to-end data observability. But what does that really mean and what’s a data value chain?

For data analysts, end-to-end data observability feels like having monitoring capabilities for their warehouse tables  — and if they’re lucky, they have some monitoring for the pipelines that move the data to and from their warehouse as well.

The story is a lot more complicated for many other organizations that are more heavily skewed towards data engineering. For them, that isn’t end-to-end data observability. That’s “The End” data observability. Meaning: this level of observability only gives visibility into the very end of the data’s lifecycle. This is where the data value chain becomes an important concept.

For many data products, data quality is determined from the very beginning; when data is first extracted and enters your system. Therefore, shifting data observability left of the warehouse is the best way to move your data operations out of a reactive data quality management framework, to a proactive one.

data observability

What is the Data Value Chain?

When people think of data, they often think of it as a static object; a point on a chart, a number in a dashboard, or a value in a table. But the truth is data is constantly changing and transforming throughout its lifecycle. And that means what you define as “good data quality” is different for each stage of that lifecycle.

“Good” data quality in a warehouse might be defined by its uptime. Going to the preceding stage in the life cycle, that definition changes. Data quality might be defined by its freshness and format. Therefore, your data’s quality isn’t some static binary. It’s highly dependent on whether things went as expected in the preceding step of its lifecycle.

Shani Keynan, our Product Director, calls this concept the data value chain.

“From the time data is ingested, it’s moving and transforming. So, only looking at the data tables in your warehouse or your data’s source, or only looking at your data pipelines, it just doesn’t make a lot of sense. Looking only at one of those, you don’t have any context.

You need to look at the data’s entire journey. The thing is, when you’re a data-intensive company who’s using lots of external APIs and data sources, that’s a large part of the journey. The more external sources you have, the more vulnerable you are to changes you can’t predict or control. Covering the hard ground first, at the data’s extraction, makes it easier to catch and resolve problems faster since everything downstream depends on those deliveries.”

The question of whether data will drive value for your business is defined by aseries of If-Then statements:

  1. If data has been ingested correctly from our data sources, then our data will be delivered to our lake as expected.
  2. If data is delivered & grouped in our lake as expected, then our data will be able to be aggregated & delivered to our data warehouse as expected.
  3. If data is aggregated & delivered to our data warehouse as expected, then the data in our warehouse can be transformed.
  4. If data in our warehouse can be transformed correctly, then our data will be able to be queried and will provide value for the business.
data warehouse

Let us be clear: this is an oversimplification of the data’s life cycle. That said, it illustrates how having observability only for the tables in your warehouse & the downstream pipelines leaves you in a position of blind faith.

In the ideal world, you would be able to set up monitoring capabilities & data health checkpoints everywhere in your system. This is no small project for most data-intensive organizations; some would even argue it’s impractical.

Realistically, one of the best places to start your observability initiative is at the beginning of the data value chain; at the data extraction layer.

Data Value Chain + Shift-left Data Observability

If you are one of these data-driven organizations, how do you set your data team up for success?

While it’s important to have observability of the critical “checkpoints” within your system, the most important checkpoint you can have is at the data collection process. There are two reasons for that:

#1 – Ingesting data from external sources is one of the most vulnerable stages in your data model.

As a data engineer, you have some degree of control over your data & your architecture. But what you don’t control is your external data sources. When you have a data product that depends on external data arriving on time to function, that is an extremely painful experience.

This is best highlighted in an example. Let’s say you are running a large real estate platform called Willow. Willow is a marketplace where users can search for homes and apartments to buy & rent across the United States.

Willow’s goal is to give users all the information they need to make a buying decision; things like listing price, walkability scores, square footage, traffic scores, crime & safety ratings, school system ratings, etc.

In order to calculate “Traffic Score” for just one state in the US, Willow might need to ingest data from 3 external data sources. There are 50 states, so that means you suddenly have 150 external data sources you need to manage. And that’s just for one of your metrics.

Here’s where the pain comes in: You don’t control these sources. You don’t get a say whether they decide to change their API to better fit their data model. You don’t get to decide whether they drop a column from your dataset. You can’t control if they miss one of their data deliveries and leave you hanging.

All of these factors put your carefully crafted data model at risk. All of them can break your pipelines downstream that follow strictly coded logic. And there’s really nothing you can do about it except catching it as early as you can.

Having data observability in your data warehouse doesn’t so much to solve this problem. It might alert you that there is bad data in your warehouse, but by that point, it’s already too late.

This brings us to our next point…

#2 – It makes the most sense for your operational flow.

In many large data organizations, data in your warehouse is being automatically utilized in your business processes. If something breaks your data collection processes, bad data is being populated into your product dashboards and analytics and you have no way of knowing that the data they are being served is no good.

This can lead to some tangible losses. Imagine if there was a problem calculating a Comparative Analysis of home sale prices in the area. Users may lose trust in your data and stop using your product.

In this situation, what does your operational flow for incident management look like?

You receive some complaints from business stakeholders or customers, you have to invest a lot of engineering hours to perform root cause analysis, fix the issue, and backfill the data. All the while consumer trust has gone down, and SLAs have already been missed. DataOps is in a reactive position.

incident management

When you have data observability for your ingestion layer, there’s still a problem in this situation, but the way DataOps can handle this situation is very different:

  • You know that there will be a problem.     
  • You know exactly which data source is causing the problem.
  • You can project how this will affect downstream processes. You can make sure everyone downstream knows that there will be a problem so you can prevent the bad data from being used in the first place.
  • Most importantly, you can get started resolving the problem early & begin working on a way to prevent that from happening again.

You cannot achieve that level of prevention when your data observability starts at your warehouse.

Bottom Line: Time To Shift Left

DataOps is learning many of the same, hard lessons as DevOps has. Just as application observability is the most effective when shifted left, the same applies to data operations. It saves money; it saves time; it saves headaches. If you’re ingesting data from many external data sources, your organization cannot afford to focus all its efforts on the warehouse. You need real end-to-end data observability. And luckily, there’s a great data observability platform made to do just that.

data observability



Data Replication: The Basics, Risks, and Best Practices

Databand
2022-04-27 13:20:30

Data-driven organizations are poised for success. They can make more efficient and accurate decisions and their employees are not impeded by organizational silos or lack of information. Data replication enables leveraging data to its full extent. But how can organizations maximize the potential of data replication and make sure it helps them meet their goals? Read on for all the answers.

What is Data Replication?

Data replication is the process of copying or replicating data from the main organizational server or cloud instance to other cloud or on-premises instances at different locations. Thanks to data replication, organizational users can access the data they need for their work quickly and easily, wherever they are in the world. In addition, data replication ensures organizations have backups of their data, which is essential in case of an outage or disaster. In other words, data replication creates data availability at low latency.

Data replication can take place either synchronously or asynchronously. Synchronously means the data is constantly copied to the main server and all replica servers at the same time. Asynchronous data replication means that data is first copied to the main server and only then copied to replica servers. Often, it occurs in scheduled intervals.

Why Data Replication is Necessary

Data replication ensures that organizational data is always available to all stakeholders. By replicating data across instances, organizations can ensure:

Scalability

Data scalability is the ability to handle changing demands by continuously adapting resources. Replication of data across multiple servers builds scalability and ensures the availability of consistent data to all users at all times.

Disaster Protection

Electrical outages, cybersecurity attacks and natural disasters can cause systems and instances to crash and no longer be available. By replicating data across multiple instances, data is backed up and always accessible to any stakeholder. This ensures system robustness, organizational reliability and security.

Speed / Latency

Data that has to travel across the globe creates latency. This creates a poor user experience, which can be felt especially in real-time based applications like gaming or recommendation systems, or resource-heavy systems like design tools. By distributing the data globally it travels a shorter distance to the end user, which results in increased speed and performance.

Test System Performance

By distributing and synchronizing data across multiple test systems, data becomes more accessible. This availability improves their performance.

An Example of Data Replication

Organizations that have multiple branch offices across a number of continents can benefit from data replication. If organizational data only resides in servers in Europe, users from Asia, North America and South America will experience latency when attempting to read the data. But by replicating data across instances in San Francisco, São Paulo, New York, London, Berlin, Prague, Tel Aviv, Hyderabad, Singapore and Melbourne, for example, all users can improve access times for all users significantly.

Data Replication Variations

Types of Data Replication

Replication systems vary. Therefore, it is important to distinguish which type is a good fit for your organizational infrastructure needs and business goals. There are three main types of data replication systems:

Transactional Replication

Transaction replication consists of databases being copied in their entirety from the primary server (the publisher) and sent to secondary servers (subscribers). Any data changes are consistently and continuously updated. Transactional consistency is ensured, which means that data is replicated in real-time and sent from the primary server to secondary servers in the order of their occurrence. As a result, transactional replication makes it easy to track changes and any lost data. This type of replication is commonly used in server-to-server environments.

Snapshot Replication

In the snapshot replication type, a snapshot of the database is distributed from the primary server to the secondary servers. Instead of continuous updates, data is sent as it exists at the time of the snapshot. It is recommended to use this type of replication when there are not many data changes or at the initial synchronization between the publisher and subscriber.

Merge Replication

A merge replication consists of two databases being combined into a single database. As a result, any changes to data can be updated from the publisher to the subscribers. This is a complex type of replication since both parties (the primary server and the secondary servers) can make changes to the data. It is recommended to use this type of replication in a server-to-client environment.

Comparison Table: Transactional Replication vs. Snapshot replication vs. Merge Replication

Data Replication Table

Schemes of Replication

Replication schemes are the operations and tasks required to perform replication. There are three main replication schemes organizations can choose from:

Full Replication

Full replication occurs when the entire database is copied in its entirety to every site in the distributed system. This scheme improves data availability and accessibility through database redundancy. In addition, performance is improved because global distribution of data reduces latency and accelerates query execution. On the other hand, it is difficult to achieve concurrency and update processes are slow.

Data Replication - Full

Partial Replication

In a partial replication scheme, some sections of the database are replicated across some or all of the sites. The description of these fragments can be found in the replication schema. Partial replication enables prioritizing which data is important and should be replicated as well as distributing resources according to the needs of the field.

Data Replication - Partial

No Replication

In this scheme, data is stored on one site only. This enables easily recovering data and achieving concurrency. On the other hand, it negatively impacts availability and performance.

No Data Replication

Techniques of Replication

Replicating data can take place through different techniques. These include:

Full-table Replication

In a full-table replication, all data is copied from the source to the destination. This includes new data, as well as existing data. It is recommended to use this technique if records are regularly deleted or if other techniques are technically impossible. On the other hand, this technique requires more processing and network resources and the cost is higher.

Key-based Replication

In a key-based replication, only new data that has been added since the previous update, is updated. This technique is more efficient since less rows are copied. On the other hand, it does not enable replication data from a previous update that might have been hard-deleted.

Log-based Replication

A log-based replication replicates any changes to the database, from the DB log file. It applies only to database sources and has to be supported by it. This technique is recommended when the source database structure is static, otherwise it might become a very resource-intensive process.

Cloud Migration + Data Replication

When organizations digitally transform their infrastructure and migrate to the cloud, data can be replicated to cloud instances. By replicating data to the cloud, organizations can enjoy its benefits: scalability, global accessibility, data availability and easier maintenance. This means organizational users benefit from data that is more accessible, usable and reliable, which eliminates internal silos and increases business agility.

Data Risks in the Replication Process

When replicating data to the cloud, it is important to monitor the process. The growing complexity of data systems as well as the increased physical distance between servers within a system could pose some risks.

These risks include:

Inconsistency

Data schema and data profiling anomalies, like null counts, type changes and skew.

Data Loss

Ensuring all data has been migrated from the sources to the instances.

Delays

Data not being successfully migrated on time.

Data Replication Management + Observability

By implementing a management system to oversee and monitor the replications process, organizations can significantly reduce the risks involved in the data replication process. A data observability platform will ensure:

  • Data is successfully replicated to other instances, including cloud instances
  • Replication and migration pipelines are performing as expected
  • Any broken pipelines or irregular data volumes are alerted about so they can be fixed
  • Data is delivered on time 
  • Delivered data is reliable, so organizational stakeholders can use it for analytics

Monitoring

By monitoring the data pipelines that take part in the replication process, organizations and their DataOps engineer can ensure the data propagated through the pipeline is accurate, complete and reliable. This ensures data replicated to all instances can be reliably used by stakeholders. An effective monitoring system will be:

  • Granular – specifically indicating where the issue is
  • Persistent – following lineage to understand where errors began
  • Automated – reducing manual errors and enabling the use of thresholds
  • Ubiquitous – covering the pipeline end-to-end
  • Timely – enabling catching errors on time before they have an impact

Learn more about data monitoring here.

Tracking

Tracking pipelines enables systematic troubleshooting, so that any errors are identified and fixed on time. This ensures users constantly benefit from updated, reliable and healthy data in their analyses. There are various types of metadata that can be tracked, like task duration, task status, when data was updated, and more. By tracking and alerting (see below) in case of anomalies, DataOps engineers ensure data health.

Alerting

Alerting about and data pipeline anomalies is an essential step that closes the observability loop. Alerting DataOps engineers gives them the opportunity to fix any data health issues that might affect data replication across various instances.

Within existing data systems, data engineers can trigger alerts for:

  • Missed data deliveries
  • Schema changes that are unexpected
  • SLA misses
  • anomalies in column-level statistics like nulls and distributions
  • Irregular data volumes and sizes
  • Pipeline failures, inefficiencies, and errors

By proactively setting up alerts and monitoring them through dashboards and other tools of your choice (Slack, Pagerduty, etc.), organizations can truly maximize the potential of data replication for their business.

Conclusion

Data replication holds great promise for organizations. By replicating data to multiple instances, they can ensure data availability and improved performance, as well as internal “insurance” in case of a disaster. This page covers the basics for any business or data engineer getting started with data replication: the variations, schemes and techniques, as well as more advanced content for monitoring the process to gain observability and reduce the potential risk.

Wherever you are on your data replication journey, we recommend auditing your pipelines to ensure data health. If you need help finding and fixing data health issues fast, click here.

The Top Data Quality Metrics You Need to Know (With Examples)

Databand
2022-04-20 14:17:41

Data quality metrics can be a touchy subject, especially within the focus of data observability.

A quick google search will show that data quality metrics involve all sorts of categories. 

For example, completeness, consistency, conformity, accuracy, integrity, timeliness, continuity, availability, reliability, reproducibility, searchability, comparability, and probably ten other categories I forgot to mention all relate to data quality. 

So what are the right metrics to track? Well, we’re glad you asked. 🙂 

We’ve compiled a list of the top data quality metrics that you can use to measure the quality of the data in your environment. Plus, we’ve added a few screenshots that highlight each data quality metric you can view in Databand’s observability platform

Take a look and let us know what other metrics you think we need to add!

Collection Data Quality Metrics

The Top 9 Data Quality Metrics

Metric 1: # of Nulls in Different Columns 

Who’s it for? 

  • Data engineers
  • Data analysts

How to track it? 

Calculate the number of nulls, non-null counts, and null percentages per column so users can set an alert on those metrics.

Why it’s important?

Since a null is the absence of value, you want to be aware of any nulls that pass through your data workflows. 

For example, downstream processes might be damaged if the data used is now “null” instead of actual data.

Dropped columns

The values of a column might be “dropped” by mistake when the data processes are not performing as expected. 

This might cause the entire column to disappear, which would make the issue easier to see. But sometimes, all of its values will be null.

Data drift

The data of a column might slowly drift into “nullness.” 

This is more difficult to detect than the above since the change is more gradual. Monitoring anomalies in the percentage of nulls across different columns should make it easier to see.

What’s it look like?

Data Quality Metrics Null Count

Metric 2: Frequency of Schema Changes

Who’s it for?

  • Data engineers
  • Data scientists
  • Data analysts

How to track it? 

Tracking all changes in the schema for all the datasets related to a certain job.

Why it’s important?

Schema changes are key signals of bad quality data. 

In a healthy situation, schema changes are communicated in advance and are not frequent since many processes rely on the number of columns and their type in each table to be stable. 

Frequent changes might indicate an unreliable data source and problematic DataOps practices, resulting in downstream data issues.

Examples of changes in the schema can be: 

  • Column type changes
  • New columns 
  • Removed columns

Go beyond having a good understanding of what changed in the schema and evaluate the effect this change will have on downstream pipelines and datasets.

What’s it look like?

Data Quality Metrics Schema change
Data Quality Metrics Alert

Metric 3: Data Lineage, Affected Processes Downstream

Who’s it for? 

  • Data engineers
  • Data analysts

How to track it? 

Tack the data linage with assets that appear downstream from a dataset with an issue. This includes datasets and pipelines that consume the upstream dataset’s data.

Why it’s important?

The more damaged data assets (datasets or pipelines) downstream, the bigger the issue’s impact. This metric helps the data engineer to understand the severity of the issue and how fast he should fix it.

It is also an important metric for data analysts because most downstream datasets make up their company’s BI reports.

What’s it look like?

Data Quality Metrics Lineage

Metric 4: # of Pipeline Failures 

Who’s it for? 

  • Data engineers
  • Data executives

How to track it? 

Track the number of failed pipelines over time. 

Use tools to understand why the pipeline failed, root cause analysis through the error widget and logs, and the ability to dive inside all the tasks that the DAG contains.

Why it’s important?

The more pipelines fail, the more data health issues you’ll have.

Each pipeline failure causes issues like missing data operations, schema changes, and data freshness issues.

If you’re experiencing many failures, this indicates severe problems at the root that needs to be addressed.

What’s it look like?

Data Quality Metrics Error widget, pipeline, tasks

Metric 5: Pipeline Duration

Who’s it for? 

  • Data engineers

How to track it? 

The team can track this with the Airflow syncer, which reports on the total duration of a DAG run, or by using our tracking context as part of the Databand SDK.

Why it’s important?

Pipelines that work in complex data processes are usually expected to have similar duration across different runs. 

In these complex environments, pipelines downstream depend on upstream pipelines processing the data in certain SLAs

The effect of extreme changes in the pipeline’s duration can be anywhere between the processing of stale data and a failure of downstream processes.

What’s it look like?

Data Quality Metrics Pipeline duration

Metric 6: Missing Data Operations

Who’s it for? 

  • Data engineers
  • Data scientists
  • Data analysts
  • Data executives

How to track it? 

Tracking all the operations related to a particular dataset.

A data operation is a combination of a task in a specific pipeline that reads or writes to a table. 

Why it’s important?

When a certain data operation is missing, a chain of issues in your data stack will be triggered. It can cause pipelines to fail, changes in the schema, and delay problems.

Also, the downstream consumers of this data will be affected by the data that didn’t arrive.  

A few examples include: 

  • The data analyst who is using this data for analysis 
  • The ML models used by the data scientist
  • The data engineers in charge of the data.

What’s it look like?

Data Quality Metrics Missing dataset
ata Quality Metrics dbnd alert

Metric 7: Record Count in a Run

Who’s it for? 

Data engineers, data analysts

How to track it? 

Track the number of raws written to a dataset.

Why it’s important?

A sudden change in the expected number of table rows signals that too much data is being written. 

Using anomaly detection in the number of rows in a dataset provides a good way of checking that nothing suspicious has happened.

What’s it look like?

Data Quality Metrics Record count in a run

Metric 8: # of Tasks Read From Dataset

Who’s it for? 

Data engineer

How to track it? 

The more tasks read from a certain dataset, the more central it is and the more important this dataset. 

Why it’s important?

Understanding the importance of the dataset is crucial for impact analysis and realizing how fast you should deal with the issue you have.

What’s it look like?

Data Quality Metrics - Tasks Read from Dataset

Metric 9: Data Freshness (SLA alert)

Who’s it for? 

Data Engineers, Data Scientists, Data Analysts

How to track it? 

We are tracking the scheduled pipelines to write to a certain dataset.

Why it’s important?

Un-fresh and un-updated data can cause wrong feeding of downstream reports and wrong information to be consumed.

A good way of knowing data freshness is to monitor your SLA and get notified of delays in the pipeline that should be written to the dataset.

What’s it look like?

Data Quality Metrics SLA alert

Wrapping it up

And that’s a quick look at some of the top data quality metrics you need to know to deliver more trustworthy data to the business. 

Check out how you can build all these metrics in Databand today.

What is a Data Catalog? Overview and Top Tools to Know

Databand
2022-04-14 12:01:00

Intro to Data Catalogs

A data catalog is an inventory of all of an organization’s data assets. A data catalog includes assets like machine learning models, structured data, unstructured data, data reports, and more. By leveraging data management tools, data analysts, data scientists, and other data users can search through the catalog, find the organizational data they need, and access it.

Governance of data assets in a data catalog is enabled through metadata. The metadata is used for mapping, describing, tagging, and organizing the data assets. As a result, it can be leveraged to enable data consumers to efficiently search through assets and get information on how to use the data. Metadata can also be used for augmenting data management, by enabling onboarding automation, anomalies alerts, auto-scaling, and more.
In addition to indexing the assets, a data catalog usually includes data access and data searching capabilities, as well as tools for enriching the metadata, both manually and automatically. It also provides capabilities for ensuring compliance with privacy regulations and security standards.
In modern organizations, data catalogs have become essential for leveraging the large amounts of data generated. Efficient data analysis and consumption can help organizations make better decisions, so they can optimize operations, build better models, increase sales, and more.

Data Catalog Benefits (Why Do You Need a Data Catalog?)

A data catalog provides multiple benefits to data professionals, business analysts, and organizations. These include:

User Autonomy

Data professionals and other data consumers can find data, evaluate it and understand how to use it – all on their own. With a data catalog, they no longer have to rely on IT or other professional personnel. Instead, they can immediately search for the data they need and use it. This speed and independence enable injecting data into more business operations. It also improves employee morale.

Improved Data Context and Quality

The metadata and comments on the data from other data citizens can help data consumers better understand how to use it. This additional information creates context and improves the data quality and encourages data usage, innovation, and more new business ideas.

Organizational Efficiency

Accessible data reduces operational friction and bottlenecks, like back and forth emails, which optimizes the use of organizational resources. Available data also accelerates internal processes. When data consumers get the data and understand how to use it faster, data analysis and implementation take place faster as well, benefiting the business.

Compliance and Security 

Data catalogs that ensure data assets comply with privacy standards and security regulations, and reduce the risks of data breaches, cyberattacks, or legal fiascos.

New Business Opportunities

By giving data citizens new information they can incorporate into their work and decision-making, they will find new ways to answer work challenges and achieve their business goals. This can open up new business opportunities, across all departments.

Better Decision Making

Lack of data visibility makes organizations rely on tribal knowledge, rely on data they are already familiar with, or recreate assets that already exist. This creates organizational data silos, which impede productivity. Enabling data access to everyone improves the ability to find and use data consistently and continuously across the organization.

What Does a Data Catalog Contain?

Different data catalogs offer somewhat different features. However, to enable data governance and advanced analysis, they should all provide the following to data consumers:

Metadata

Technical Metadata

The data that describes the structure of the objects, like tables, schemas, columns, rows, file names, etc.

Business Metadata

Data about the business value of the data, like its purpose, compliance info, rating, classification, etc.

Process Metadata

Data about the asset creation process and lineage, like who changed it and when, permissions, latest update time, etc.

Search Capabilities

Searching, browsing, and filtering options to enable data consumers to easily find the relevant data assets.

Metadata Enrichment

The ability to automatically enrich metadata through mappings and connections, as well as letting data citizens manually contribute to the metadata.

Compliance Capabilities

Embedded capabilities that ensure data can be trusted and no sensitive data is exposed. This is important for complying with regulations, standards, and policies. 

Asset Connectivity

The ability to connect to and automatically map all types of data sources your organization uses, at the locations they reside at.

In addition, in technologically advanced and enterprise data catalogs, AI and machine learning are implemented.

Data Catalog Use Cases

Data catalogs can and should be consumed by all people in the organization. Some popular use cases include:

  • Optimizing the data pipeline
  • Data lake modernization
  • Self-service analytics
  • Cloud spend management
  • Advanced analytics
  • Reducing fraud risk
  • Compliance audits
  • And more

Who Uses a Data Catalog?

A data catalog can be used by data-savvy citizens, like data analysts, data scientists and data engineers. But all business employees – product, marketing, sales, customer success, etc – can work with data and benefit from a data catalog. Data catalogs are managed by data stewards.

Top 10 Data Catalog Tools

Here are the top 10 data catalog tools according to G2, as of Q1 2022:

1. AWS

  • Product Name: AWS Glue
  • Product Description: AWS Glue is a serverless data integration service for discovering, preparing, and combining data for analytics, machine learning and application development. Data engineers and ETL developers can visually create, run, and monitor ETL workflows. Data analysts and data scientists can enrich, clean, and normalize data without writing code. Application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

2. Aginity

  • Product Name: Aginity
  • Product Description: Aginity provides a SQL coding solution for data analysts, data engineers, and data scientists so they can find, manage, govern, share and re-use SQL rather than recode it.

3. Alation

  • Product Name: Alation Data Catalog
  • Product Description: ​​Alation’s data catalog indexes a wide variety of data sources, including relational databases, cloud data lakes, and file systems using machine learning. Alation enables company-wide access to data and also surfaces recommendations, flags, and policies as data consumers query in a built-in SQL editor or search using natural language. Alation connects to a wide range of popular data sources and BI tools through APIs and an Open Connector SDK to streamline analytics.

4. Collibra

  • Product Name: Collibra Data Catalog
  • Product Description: Collibra ensures teams can quickly find, understand and access data across sources, business applications, BI, and data science tools in one central location. Features include out-of-the-box integrations for common data sources, business applications, BI and data science tools; machine learning-powered automation capabilities; automated relationship mapping; and data governance and privacy capabilities.

5. IBM

  • Product Name: IBM Watson Knowledge Catalog
  • Product Description:  A data catalog tool based on self-service discovery of data, models and more. The cloud-based enterprise metadata repository activates information for AI, machine learning (ML), and deep learning. IBM’s data catalog enables stakeholders to access, curate, categorize and share data, knowledge assets and their relationships, wherever they reside.

6. Appen

  • Product Name: Appen
  • Product Description: Appen provides a licensable data annotation platform for training data use cases in computer vision and natural language processing. In order to create training data, Appen collects and labels images, text, speech, audio, video, and other data. Its Smart Labeling and Pre-Labeling features that use Machine Learning ease human annotations.

7. Denodo

  • Product Name: Denodo
  • Product Description: Denodo provides data virtualization that enables access to the cloud, big data, and unstructured data sources in their original repositories. Denodo enables the building of customized data models for customers and supports multiple viewing formats.

8. Oracle

  • Product Name: Oracle Enterprise Metadata Management 
  • Product Description: Oracle Enterprise Metadata Management harvests metadata from Oracle and third-party data integrations, business intelligence, ETL, big data, database, and data warehousing technologies. It enables business reporting, versioning, and comparison of metadata models, metadata search and browsing, and data lineage and impact analysis reports.

9. Unifi

  • Product Name: Unifi Data Catalog
  • Product Description: A standalone Data Catalog with intuitive natural language search powered by AI, collaboration capabilities for crowd-sourced data quality, views of trusted data, and all fully governed by IT. The Unifi Data Catalog offers data source cataloging, search and discovery capabilities throughout all data locations and structures, auto-generated recommendations to view and explore data sets and similar data sets, integration to catalog Tableau metadata, and the ability to deconstruct TWBX files and see the full lineage of a data source to see how data sets were transformed.

10. BMC

  • Product Name: Catalog Manager for IMS
  • Product Description: A system database that stores metadata about databases and applications. Catalog Manager for IMS enables viewing IMS catalog content, reporting on the control block information in the IMS catalog, and creating jobs to do DBDGENs, PSBGENs, and ACBGENs to populate the catalog.

Data Lakes and Data Catalogs

A data catalog can organize and govern data that reside in repositories, data lakes, data warehouses, or other locations. A data catalog can help organize the unstructured data in the data lake, preventing it from turning into a “data swamp”. As a result, data scientists and data analysts can easily pull data from the lake, evaluate it and use it.

A Data Catalog and Databand

Databand is a proactive observability platform for monitoring and controlling data quality, as early as ingestion. By integrating Databand with your data catalog, you can gain extended lineage, and visualize and observe the data from its source and as it flows through the pipelines all the way to the assets the data catalog maps and governs. As a result, data scientists, engineers and other data professionals can see and understand the complete flow of data, end-to-end.

In addition, by integrating Databand with your data catalog, you can get proactive alerts any time your data quality is affected to increase governance and robustness. This is enabled through Databand’s data quality identification capabilities, combined with how data catalogs map assets to owners. Databand will communicate any data quality issues to the relevant data owners.

What is a Data Mesh Architecture?

Databand
2022-04-12 11:10:00

Intro to Data Mesh

A data mesh is a form of platform architecture. 

The goal of the data mesh in organizing a business’ platforms is to maximize the value of analytical data. This is done by minimizing the time needed to access quality data. A well-designed data mesh delivers cutting-edge efficiency, allowing researchers to quickly access data from any data accessible source within the data mesh system. The data mesh model may replace data lakes as the most popular way to store and retrieve data.

Three components support data mesh architecture: domain-supported data pipeline, data sources, and data infrastructure. There are layers of observability, data governance, and universal interoperability. 

Data mesh systems are useful for businesses with multiple data domains. 

Many companies have data stored in different databases and formats, causing research and analytics problems. Some companies have attempted to resolve these problems by creating a single data warehouse or central data lake and downloading all data to it. This solves its problems, such as accessing an inaccurate copy of the original data and outdated information. 

Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage.   

Data mesh architecture allows data access from a number of locations rather than one central data warehouse or data lake. 

(It should be noted that there are situations where it is completely appropriate to build a central data lake as an additional part of the data mesh system.)

The Data Mesh Philosophy

The primary goal of data mesh is to create a system that maximizes the value of analytical data. The data mesh philosophy embraces a constantly changing data landscape, including increasing sources of data, the ability to transform data from one format to another, and improving the response time to change. 

Four principles support the data mesh model:

  1. Federated computational governance.
  2. Domain-oriented and decentralized data ownership, as well as architecture. 
  3. A self-serve platform as part of the data infrastructure.
  4. Data-as-a-product rather than a by-product. 

Governance

Data mesh uses a system called federated computational governance. A federated model includes a cross-domain agreement describing which parts of the governance are managed by the data domains and which are handled by the provider. It is an autonomous system that is normally built and maintained by independent data teams for each domain. (Independent data teams can be made up of in-house staff or outside contractors). To get the maximum value, interoperability between data domains is a necessity. 

The “federation” is a group of people made up of domain owners and the data mesh provider. While using a framework of globalized rules they decide on how to best govern the data mesh system.

Ideally, the governance federation will establish a data governance program that is common for all the domain owners. Domain owners can still develop their own data governance program, but an agreement providing a base level of data quality for the group as a whole will provide more trustworthy distributed data.

Decentralized Data Ownership

The concept of decentralized data ownership describes an architectural model in which data is not owned by a specific domain (department or business partner) but is freely shared with other business domains. 

In the data mesh model, data is not owned or controlled by the people storing it – rather, it is stored and managed by the department or business partner, understanding that the data is meant to be shared. 

The goal of the department or partner storing the data should be to offer it in a way that is easy to access and easy to work with.

The Self-Service Platform

The data mesh self-service platform, part of the architectural design, supports functionality from storage and processing to the data catalog. The self-service platform is an essential feature. The host or provider should supply a development platform that domain engineers can use for integrating the platform into their domain. 

The model supports the use of autonomous domains. A “network” is a group of computers capable of communicating with each other and is needed to create a domain. A domain describes workstations, devices, computers, and database servers sharing data by way of network resources. 

The self-service platform must be domain-agnostic (capable of working with multiple data domains) for the system to work. This allows each domain to be customized as needed. Additionally, the domain’s data engineering teams have the freedom to develop and design solutions for their specific issues. This design provides both flexibility and efficiency.

According to Zhamak Dehghani, the creator of the data mesh model, useful features for the data catalog include:

  • Data governance and standardization
  • Encryption for the data, both at rest and in motion
  • Data discovery, catalog registration, and publishing
  • Data schema
  • Data production lineage
  • Data versioning
  • Data quality metrics
  • Data monitoring, alerting, and logging

Monolithic Data Architectures vs Data Mesh Architecture

A good example of monolithic data architectures is a relational database management system (RDBMS) using a SQL database. The word monolithic means “all in one piece” rather than “too large and unable to be changed.” The phrase ‘monolithic data architectures’ describes a database management system using a variety of integrated software programs that work together to process data. With this design, data is not typically available for sharing with other organizations.

On the other hand, data mesh promotes data democratization and data sharing by allowing data-driven consumers to access data across all associated organizations. This results in more businesses making a profit from the same data. 

A data mesh is decentralized and supports data owners sharing their data, being responsible for their own domains, and handling their own data products and pipelines. Sharing in the data mesh includes making their data available user-friendly and easily consumable. 

The data mesh supports near-real-time data sharing because the data transmitted between domains use a “change data capture” (CDC) mechanism.

Data-as-a-Product

The data-as-a-product principle is an important foundation of the data mesh model and is philosophically opposed to data silos. The data mesh philosophy supports sharing data, and the purpose of a data silo is to isolate data. Data silos can be avoided through the use of cross-domain governance (per the federation) and semantic linking of data.

Data-as-a-product (as opposed to data-as-a-service) is used for decision making, developing personalized products, and fraud detection. Data-as-a-service tends to focus more on insights and strategy. Features such as trustworthiness, discoverability, and understandability are necessary for data to be treated as a product.

Preventing Data Silos

Data mesh systems eliminate the use of data silos. Data silos are data collections within an organization that has become isolated. The data it contains is typically available to one department but cannot be accessed by other parts of the business. This distorts the ability of good decision-making. 

Silos are dangerous because they limit management’s understanding of the business, effectively blocking useful information.

Improved Data Analytics

In the last decade, the use of data analytics has increased steadily. Consequently, businesses are continuously attempting to improve the quality of their data. The data mesh model offers improved data collection and a remarkably efficient way of storing and managing data. It offers clean, accurate data for data analytics.

Data Pipelines

Data pipelines are an important part of the data mesh architectural model. As organizations take on increasingly complex analytic projects, data pipelines can assist in supplying quality data.

The data mesh model supports the total customization of data pipelines.

A data pipeline is made up of a data source, a series of processing steps, and a destination. If the desired data is not located within the data platform, then it is collected at the beginning of the pipeline. After the collection, a number of steps are taken, with each step delivering an output that becomes the input for the next step. 

A data pipelines process data between the initial ingestion source and the final destination. Steps that are common in a data pipeline include: 

  • Data transformation 
  • Filtering
  • Augmentation
  • Enrichment
  • Aggregating
  • Grouping
  • Running of algorithms against that data

These pipeline steps can be performed in parallel or in a time-sliced fashion.

Data Catalogs

A data catalog is the organized inventory of data for an organization. Metadata is used to help businesses organize and manage their data. The data catalog also uses metadata to help with data discovery and data governance. Data catalogs scan metadata automatically, allowing the catalog’s data consumers to seek and find their data. This includes information about the data’s availability, quality, and freshness.

Part of a data catalog’s function is to serve different end-users (data analysts, data scientists, business analysts, etcetera) who probably have different goals. A good data catalog will be user-friendly and flexible enough to adapt to its end-user’s needs. 

As with the data pipeline, the data catalog supports data governance, offering a more thorough process. Data catalogs use a bottom-up approach to create an agile data governance program. People can use data catalogs to document legal obligations and track the life cycle of data.

Data Observability

Another benefit is data observability. It is a part of the data mesh architecture and part of its strategy. Data observability provides a pulse check on the data’s health and is also considered a best practice for businesses. Data observability uses various tools designed to manage and track an organization’s data reliability and quality.

Databand offers a proactive data observability platform that integrates into the data mesh architecture. The platform allows users to identify anomalies and see trends in the pipeline metadata. It can profile column statistics and explain the causes of unreliable data and its impact.

What is a Modern Data Platform? Understanding the Key Components

Databand
2022-04-06 13:48:03

A modern data platform should provide a complete solution for the processing, analyzing, and presentation of data. It is built as a cloud-first, cloud-native platform, and, normally, can be set up within a few hours. A modern data platform is supported not only by technology, but also by the Agile, DevOps, and DataOps philosophies.

Currently, data lakes and data warehouses are popular storage systems, but each comes with some limitations.

Data lakehouses and data mesh storage systems are two new systems attempting to overcome those limitations, and are showing signs of gaining popularity.

The modern data platform typically includes six foundational layers guided by principles of elasticity and availability.

Data Platform

The Philosophies

DevOps and DataOps have two entirely different purposes, but both are similar to the Agile philosophy, which is designed to accelerate project work cycles.

DevOps is focused on product development, while DataOps focuses on creating and maintaining a distributed data architecture system with the goal of creating business value from data.

Agile is a philosophy for software development that promotes speed and efficiency, but without eliminating the “human” factor. It places an emphasis on face-to-face conversations as a way to maximize communications and emphasizes automation as a way to minimize errors.

Data Ingestion

The process of placing data into a storage system for future use is called data ingestion. In simple terms, data ingestion means moving data taken from other sources to a central location. From there the data can be used for record-keeping purposes, or for further processing and analysis. Both analytics systems and downstream reporting rely on accessible, consistent, and accurate data.

Data Ingestion

Organizations make business decisions using the data from their analytics infrastructure. The value of their data is dependent on how well it is ingested and integrated. If there are problems during the ingestion process, such as missing data, every step of the analytics process will suffer.

Batch processing vs stream processing

Ingesting data can be done in different ways, and the way a particular data ingestion layer is designed can be based on different processing models. Data can come from a variety of distinct sources, ranging from SaaS platforms to the internet of things to mobile devices. A good ingestion model acts as a foundation for an efficient data strategy, and organizations normally choose the model best-suited for the circumstances.

Batch processing is the most common form of data ingestion. But it is not designed to deal with customers in real time. Instead it collects and groups source data into batches, which are sent  to the destination. 

Batch processing may be initiated using a simple schedule, or it may be activated when certain conditions exist.  It is often used when the use of real-time data is not needed, as it is usually easier and less expensive than streaming ingestion.

Real-time processing (also referred to as streaming or stream processing) does not group data. Instead, data is obtained, transformed, and loaded as soon as it is recognized. Real-time processing is more expensive because it requires constant monitoring of data sources and accepts new information, automatically. 

Data Pipelines

Modern data ingestion models, until recently, used an ETL (extract, transform, load procedure) to take data from its source, reformatting it, and then transporting it to its destination. This made sense when businesses had to use expensive in-house analytics systems, and doing the prep work before delivering it, including transformations, lowered costs.

That situation has changed, and more updated cloud data warehouses (Snowflake, Google BigQuery, Microsoft Azure, and others) can now cost-effectively scale their computing and storage resources. These improvements allow the preload transformation steps to be dropped, with raw data being delivered to the data warehouse.

At this point, the data can be translated into an SQL format, and then run within the data warehouse during research. This new processing arrangement has changed ETL to ELT (extract, load, transform). 

Instead of extracting the data and then transforming it, with ELT data is transformed “after” it is in the cloud’s data warehouse.

Data Transformation

Data transformation deals with changing the values, structure, and format of data. This is often necessary for data analytics projects. Data can be transformed during one of two stages when using a data pipeline, before arriving at its storage destination, or after. Organizations still using on-premises data warehouses will normally use an ETL process.

Today, many organizations are using cloud-based data warehouses. These can scale computing and storage resources as needed. The ability of the cloud to scale allows businesses to bypass the preload transformations and send raw data into the data warehouse. The data is transformed after arriving, using an ELT process, typically when answering a query. 

There are various advantages to transforming data:

  • Usability – Too many organizations sit on a bunch of unusable, unanalyzed data. Standardizing data and putting it under the right structure allows your data team to generate business value out of it.
  • Data quality – Transforming raw data can lead to missing values, poorly formatted variables, null rows, etcetera. (It is also possible to use data transformation to “improve” data quality.) 
  • Better organization – transformed data is easier to process for both people and computers

Data Storage and Processing

Currently, The two most popular storage formats are data warehouses and data lakes. And then there are two storage formats that are gaining in popularity — the data lakehouse and data mesh. Modern data storage systems are focused on using data efficiently. 

Data Storage and Processing

The Data Warehouse

Cloud-based data warehouses have been the preferred data storage system for a number of years because they can optimize computing power and processing speeds. They were developed much earlier than data lakes and can be traced back to the 1990s when databases were used for storage. The early versions of data warehouses were in-house and had very limited storage capacity. In 2013, many data warehouses shifted to the cloud and gained scalable storage. 

The Data Lake

Data lakes were originally built on Hadoop, were scalable, and were designed for on-premises use. In January of 2008, Yahoo released Hadoop (based on NoSQL) as an open-source project to the Apache Software Foundation. Unfortunately, the Hadoop ecosystem is extremely complex and difficult to work with. Data Lakes began shifting to the cloud around 2015, making them much less expensive, and much more user-friendly.

Using a combination of data lakes and data warehouses to minimize their limitations has become a common practice. 

The Data Lakehouse 

Data lakes have problems with “parsing data.” They were originally designed to collect data in its natural format, without enforcing schema (formats), so that researchers could gain more insights from a broad range of data. Unfortunately, data lakes can become data swamps, with old, inaccurate information and useless information, making them much less effective.

Data warehouses are designed for managing structured data with clear and defined use cases. 

For the data warehouse to function properly, the data must be collected, reformatted, cleaned, and uploaded to the warehouse. Some data, which cannot be reformatted, may be lost. 

The data lakehouse has been designed to merge the strengths of data warehouses and lakes. 

Data lakehouses are a new form of data management architecture. They merge the flexibility, cost-efficiency, and scaling abilities of data lakes with the ACID transactions and data management features of data warehouses. 

Data lakehouses support business intelligence and machine learning. One of the data lakehouse’s strengths is its use of metadata layers. It also uses a new query engine, designed for high-performance SQL searches.

Data Mesh

Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage. 

Data mesh, unlike data warehouses, lakes, and lakehouses, is “decentralized.” Decentralized data ownership is an architectural model where a specific domain (business partners or other departments) does not own their data, but shares data freely with other domains. 

Data is not owned in the data mesh model. It is not owned by the people storing it — but they are responsible for it. The data is stored and organized by the business partner or department, with the knowledge the data is to be shared. This means all data within the data mesh system should maintain a uniform format.

Data mesh systems can be useful for businesses supporting multiple data domains. Within the data mesh design, there is a data governance layer and a layer of observability. There is also a universal interoperability layer. 

Data Observability

Data observability has recently become a hot topic. Data observability describes the ability to watch and observe the state of data and its health. It covers a number of activities and technologies that, when combined, allow the user to identify and resolve data difficulties in near real-time.

Data observability platforms can be used with data warehouses, data lakes, data lakehouses, and data mesh. 

It should be noted Databand has developed what is called a proactive data observability platform capable of catching bad data before it causes damage. 

Observability allows teams to answer specific questions about what is taking place behind the scene in extremely distributed systems. Observability can show where data is moving slowly and what is broken.  

Managers and/or teams can be sent alerts about potential problems and pro-actively solve them. (While the predictability feature can be helpful, it will not catch all problems, nor should it be expected to. Think of problem predictions as helpful, but not a guarantee.) 

To make data observability useful, it needs to include these features:

  • SLA Tracking – This feature measures pipeline metadata and data quality against pre-defined standards.
  • Monitoring – A dashboard is provided, showing the operations of your system or pipeline.
  • Logging – Historical records (tracking, comparisons, analysis) of events are kept for comparison with newly discovered anomalies.
  • Alerting – Warnings are sent out for both anomalies and expected events.
  • Analysis – An automated detection process that adapts to your system.
  • Tracking –  Offers the ability to track specific events.
  • Comparisons – Provides a historical background, and anomaly alerts.

For many organizations, observability is siloed, meaning only certain departments can access the data. (This “should not” happen in a data mesh system, which philosophically requires the data to be shared, and is generally discouraged in most storage and processing systems.) Teams collect metadata on the pipelines they own. 

Business Intelligence & Analytics

In 1865, the phrase ‘Business Intelligence’ was used in the Cyclopædia of Commercial and Business Anecdotes. This described how Sir Henry Furnese (who was a banker) profited from the information he gathered, and how he used it before his competition.

Currently, a great deal of business information is gathered from business analytics, as well as data analytics. Analytics is used to generate business intelligence by transforming data into understandable insights which can help to make tactical and strategic business decisions. Business intelligence tools can be used to access and analyze data, providing researchers with detailed intelligence.

Data Discovery

Data discovery involves collecting and evaluating data from different sources. It is often used to gain an understanding of the trends and patterns found in the data. Data discovery is sometimes associated with business intelligence because it can bring together siloed data for analysis. 

Data discovery includes connecting a variety of data sources. It can clean and prepare data, and perform analytics. Inaccessible data is essentially useless data, and data discovery makes it useful. 

Data discovery is about exploring data with visual tools which can help business leaders detect new patterns and anomalies.

What’s Coming Next?

If you search for “Modern Data Platform Trends” in Google, you’ll see many articles discussing trends on what’s next for the data platform. Topics like metadata management, building a metrics layer, and reverse ETL are getting a lot of focus.

However, the trend of data observability seems universally pervasive in all these articles. Data-driven companies can’t afford to constantly question whether or not the data they consume is reliable and trustworthy.

How Google (GCP) Ensures Delivery Velocity in their Data Stack

Databand
2022-04-05 14:53:38

Data stacks enable data integration throughout the entire data pipeline for trustworthy consumption. But how can companies ensure their data stack is both modern and reliable? In this blog post, we discuss these issues, as well as how GCP manages its data stack. 

This blog post is based on a podcast where we hosted Sudhir Hasbe, senior director of product management for all data analytics services at Google Cloud. 

You can listen to the entire episode below or here.

What is a Data Stack?

In today’s world, we have the capacity and ability to track almost any piece of data. But attempting to find relevant information in such huge volumes of data is not always so easy to do. A data stack is a tool suite for data integration. It transforms or loads data into a data warehouse, enables transformation through an engine running on top and provides visibility for building applications.

As companies evolve, they often move towards a modern data stack that is based on a cloud-based warehouse. Such a stack enables real-time, personalized experiences or predictions on SaaS applications by supporting real-time events and decision-making.

How do Modern Data Stacks Support Real-Time Events?

To support real-time events, modern data stacks include the following components:

  • A system for collecting events, like Kafka
  • A processing system, like a streaming analytics systems or a Spark streaming solution
  • A serving layer
  • A data lake or staging environment where raw data can be pushed and transformed before it gets loaded into a data warehouse
  • A data warehouse for structured data that enables creating machine learning models

Such an environment is very complex and requires controls to ensure high data quality. Otherwise, bad data will be pulled into different systems and create a poor customer experience. 

Therefore, it is important to ensure data quality is taken into consideration as early as the design phase of the data stack.

How Can You Ensure Data Quality in the Data Stack?

As companies rely on more and more data sources, managing them becomes more complex. Therefore, to ensure data quality throughout the entire pipeline, it becomes important to understand the source of issues, i.e where they are originally coming from.

Data engineers who only look at the tables or the dashboards downstream will be wasting a lot of time trying to find where issues are coming from. They might be able to catch the problem, but by tracing them back to the source they will be able to debug and solve the issue much more quickly. 

By shifting left observability requirements, data engineers can ensure data quality as early as ingestion, and enable much higher delivery velocity.

How Google (GCP) Ensures Delivery Velocity in the Data Stack

One of the main pain points data teams have when building and operationalizing a data stack is how to ensure delivery velocity. 

This is true for both on-prem and cloud-native stacks, but becomes more pressing when companies are required to support real-time events both quickly and with high data quality.

To ensure delivery velocity at GCP, the team implements the following solutions.

End-to-end Pipelines

At GCP, one of the most critical components in the data stack is end-to-end pipelines. Thanks to these pipelines, they can ensure real-time events from various sources are available in their data warehouse, BigQuery. To cover streaming analytics use cases, data is seamlessly connected to BigTable and DataFlow.

Consistent Storage

At GCP, all storage capabilities are unified and consistent in a single data lake, enabling different types of processing on top of it. Thus, each persona can use their own skills and tools to consume it. 

For example, a data engineer could use Java or Python, a data scientist could use notebooks or TensorFlow and an analyst could use other tools to analyze the data.

The Future of Data Management in the Data Stack

Here are three interesting predictions regarding the future of data stacks.

Leverage AI and ML to Improve Data Quality

One of the most interesting ideas when discussing the future of data management is about implementing AI and ML to improve data quality. Often, machine learning is used to improve business metrics. 

At GCP, the team is implementing ML on BigQuery to identify failures. They find it is the only way to detect issues at scale. While this practice hasn’t been widely adopted by many companies, yet, it is expected to in the future.

Less Manual, More Automation 

Automation is predicted to be widely adopted, as a means for managing the huge volumes of data of the future. 

Today, manual management of legacy data platforms is complex, since it is based on components like Hadoop Spark running on a data warehouse with manually-defined rules and integrations with Jira to enable more personas to run queries. 

The result is often hundreds of tickets with false alarms. In the future, automation will cover: 

  • Metric collection
  • Alerts (without having to manually define rules)
  • New data sources and products

It will include automatic data discovery through automated cataloging of all information, including centralized metadata, automated lineage tracking across all systems and metadata lineage. 

This will reduce the number of errors and streamline the process.

Persona Changes are Coming

Finally, we predict that the personas who process and consume the data will change as well. Data consumption will not be limited to data engineers, data scientists and analysts, but will be open to all business employees. 

As a result, in the future, storage will just be a price-performance discussion for customers rather than capability differentiation. 

The future sounds bright! Databand provides data engineers with observability into data sources straight from the source. To get a free trial, click here.

Why data engineers need a single pane of glass for data observability

Databand
2022-03-11 15:13:43

Data engineers manage data from multiple sources and throughout pipelines. But what happens when a data deficiency occurs? Data observability provides engineers with the information and recommendations needed to fix data issues, without them having to comb through huge piles of data. Read on about what is observability and the best way to implement it.

If you’re interested in learning more, you can listen to the podcast this blog post is based on below or here.

What is Data Observability?

In today’s world, we have the capacity and ability to track almost any piece of data. But attempting to find relevant information in such huge volumes of data is not always so easy to do. Data observability is the techniques and methodologies that bring the right and relevant level of data information to data engineers at the right time, so they can understand problems and solve them.

Data observability provides data engineers with metrics and recommendations that help them understand how the system is operating. Through observability, data engineers can better set up systems and pipelines, observe the data as it flows through the pipeline, and investigate how it affects data that is already in their warehouse. In other words, data observability makes it easier for engineers to access their data and act upon any issues that occur. 

With data observability, data engineers can answer questions like:

  • Are my pipelines running with the correct data?
  • What happens to the data as it flows through the pipelines?
  • What does my data look like once it’s in the warehouse, data lake, or lakehouse?

Why We Need Data Observability

Achieving observability is never easy, but ingesting data from multiple sources makes it even harder. Enterprises often work with hundreds of sources, and even nimble startups rely on a considerable number of data sources for their products. Yet, today’s data engineering teams aren’t equipped with tools and resources to manage all that complexity.

As a result, engineers are finding it difficult to ensure the reliability and quality of the data that is coming in and flowing through the pipelines. Schema changes, missing data, null records, failed pipelines, and more – all impact how the business can use data. If engineers can’t identify and fix data deficiencies before they make a business impact, the business can’t rely on it.

Achieving Data Observability with a Single Pane of Glass

The data ecosystem is fairly new and it is constantly changing. New open source and commercial solutions emerge all the time. As a result, the modern data stack is made up of multiple point solutions for data engineers. These include tools for ETLs, operational analytics, data warehouses, dbt, extraction and loading tools, and more. This fragmentation makes it hard for organizations to manage and monitor their data pipeline.

A recent customer in the Cryptocurrency Industry said this: 

“We spend a lot of time fixing operational issues due to fragmentation of our data stack.“

Tracking data quality, lineage, and schema changes becomes a nightmare.”

The one solution missing from this stack is a single overarching operating system for orchestrating, integrating and monitoring the stack, i.e – a single tool for data observability. A single pane of glass for observability could enable engineers to look at various sources of data in a single place and see what has changed. They could identify changed schemas or faulty columns. Then, they could build automated checks to ensure errors wouldn’t recur.

For engineers, this is a huge time saver. For organizations, this means they can use their data for making decisions.

As we see the data ecosystem flow from fragmentation to consolidation, here are a few features a data observability system should provide data engineers with:

  • Visualization – enabling data engineers to see data reads, writes and lineage throughout the pipeline and the impact  of new data on warehouse data.
  • Supporting all data sources – showing data from all sources, and showing it as early as ingestion
  • Supporting all environments – observing all environments, pre-prod and prod, existing and new
  • Alerts – notifying data engineers about any anomalies, missed data deliveries, irregular volumes, pipeline failures or schema changes and providing recommendations for fixing issues
  • Continuous testing – running through data source, tables and pipelines multiple times a day, and even in real-time if your business case requires it (like in healthcare or gaming)

Databand provides unified visibility for data engineers across all data sources. Learn more here.