See how much you can save with the Databand ROI calculator. Try it now

The Impact of Bad Data and Why Observability is Now Imperative

Ryan Yackel
2022-06-02 15:33:05

The Impact of Bad Data and Why Observability is Now Imperative

Think the impact of bad data is just a minor inconvenience? Think again.

Bad data cost Unity, a publicly-traded video game software development company, $110 million.

And that’s only the tip of the iceberg.

The Impact of Bad Data: A Case Study on Unity

Unity stock dropped 37% on May 11, 2022, after the company announced its first-quarter earnings, despite strong revenue growth, decent margins, good customer growth, and continued high performance in dollar-based net expansion.

But there was one data point in Unity’s earnings that were not as positive.

The company also shared that its Operate revenue growth was still up but had slowed due to a fault in its platform that reduced the accuracy of its Audience Pinpointer tool.

The fault in Unity’s platform? Bad data.

Unity ingested bad data from a large customer into its machine learning algorithm, which helps place ads and allows users to monetize their games. This not only resulted in decreased growth but also ruined the algorithm, forcing the company to fix it to remedy the problem going forward.

The company’s management estimated the impact on the business at approximately $110 million in 2022.

Unity Isn’t Alone: The Impact of Bad Data is Everywhere

Unity isn’t the only company that has felt the impact of bad data deeply.

Take Twitter.

On April 25, 2022, Twitter accepted a deal to be purchased by Tesla and SpaceX founder Elon Musk. A mere 18 days later, Musk shared that the deal was “on hold” as he confirmed the number of fake accounts and bots on the platform.

What ensued demonstrates the deep impact of bad data on this extremely high-profile deal for one of the world’s most widely-used speech platforms. Notably, Twitter has battled this data problem for years. In 2017, Twitter admitted to overstating its user base for several years, and in 2016 a troll farm used more than 50,000 bots to try to sway the US presidential election. Twitter first acknowledged fake accounts during its 2013 IPO.

Now, this data issue is coming to a head, with Musk investigating Twitter’s claim that fake accounts represent less than 5% of the company’s user base and angling to reduce the previously agreed upon purchase price as a result.

Twitter, like Unity, is another high-profile example of the impact of bad data, but examples like this are everywhere – and it costs companies millions of dollars.

Gartner estimates that bad data costs companies nearly $13 million per year, although many don’t even realize the extent of the impact. Meanwhile, Harvard Business Review finds that knowledge workers spend about half of their time fixing data issues. Just imagine how much effort they could devote elsewhere if issues weren’t so prevalent.

Overall, bad data can lead to missed revenue opportunities, inefficient operations, and poor customer experiences, among other issues that add up to that multi-million dollar price tag.

Why Observability is Now Imperative for the C-Suite

The fact that bad data costs companies millions of dollars each year is bad enough. The fact that many companies don’t even realize this because they don’t measure the impact is potentially even worse. After all, how can you ever fix something of which you’re not fully aware?

Getting ahead of bad data issues requires data observability, which encompasses the ability to understand the health of data in your systems. Data observability is the only way that organizations can truly understand not only the impact of any bad data but also the causes of it – both of which are imperative to fixing the situation and stemming the impact.

It’s also important to embed data observability at every point possible with the goal of finding issues sooner in the pipeline rather than later because the further those issues progress, the more difficult (and more expensive) they become to fix.

Critically, this observability must be an imperative for C-suite leaders, as bad data can have a serious impact on company revenue (just ask Unity and Twitter). Making data observability a priority for the C-suite will help the entire organization – not just data teams – rally around this all-important initiative and make sure it becomes everyone’s responsibility.

This focus on end-to-end data observability can ultimately help:

  • Identify data issues earlier on in the data pipeline to stem their impact on other areas of the platform and/or business
  • Pinpoint data issues more quickly after the pop up to help arrive at solutions faster
  • Understand the extent of data issues that exist to get a complete picture of the business impact

In turn, this visibility can help companies recover more revenue faster by taking the necessary steps to mitigate bad data. Hopefully, the end result is a fix before the issues end up costing millions of dollars. And the only way to make that happen is if everyone, starting with the C-suite, prioritizes data observability.

Data observability without the headaches

Empower DataOps with predictive alerting, ML-powered anomaly detection, and
lineage tracing for your end-to-end pipelines.

The Data Value Chain: Data Observability’s Missing Link

Databand
2022-05-18 13:18:01

The Data Value Chain: Data Observability’s Missing Link

Data observability is an exploding category. It seems like there is news of another data observability tool receiving funding, an existing tool is announcing expanded functionality, and many new products in the category are being dreamt up. After a bit of poking around, you’ll notice that many of them claim to do the same thing: end-to-end data observability. But what does that really mean and what’s a data value chain?

For data analysts, end-to-end data observability feels like having monitoring capabilities for their warehouse tables  — and if they’re lucky, they have some monitoring for the pipelines that move the data to and from their warehouse as well.

The story is a lot more complicated for many other organizations that are more heavily skewed towards data engineering. For them, that isn’t end-to-end data observability. That’s “The End” data observability. Meaning: this level of observability only gives visibility into the very end of the data’s lifecycle. This is where the data value chain becomes an important concept.

For many data products, data quality is determined from the very beginning; when data is first extracted and enters your system. Therefore, shifting data observability left of the warehouse is the best way to move your data operations out of a reactive data quality management framework, to a proactive one.

data observability

What is the Data Value Chain?

When people think of data, they often think of it as a static object; a point on a chart, a number in a dashboard, or a value in a table. But the truth is data is constantly changing and transforming throughout its lifecycle. And that means what you define as “good data quality” is different for each stage of that lifecycle.

“Good” data quality in a warehouse might be defined by its uptime. Going to the preceding stage in the life cycle, that definition changes. Data quality might be defined by its freshness and format. Therefore, your data’s quality isn’t some static binary. It’s highly dependent on whether things went as expected in the preceding step of its lifecycle.

Shani Keynan, our Product Director, calls this concept the data value chain.

“From the time data is ingested, it’s moving and transforming. So, only looking at the data tables in your warehouse or your data’s source, or only looking at your data pipelines, it just doesn’t make a lot of sense. Looking only at one of those, you don’t have any context.

You need to look at the data’s entire journey. The thing is, when you’re a data-intensive company who’s using lots of external APIs and data sources, that’s a large part of the journey. The more external sources you have, the more vulnerable you are to changes you can’t predict or control. Covering the hard ground first, at the data’s extraction, makes it easier to catch and resolve problems faster since everything downstream depends on those deliveries.”

The question of whether data will drive value for your business is defined by a

series of If-Then statements:

  1. If data has been ingested correctly from our data sources, then our data will be delivered to our lake as expected.
  2. If data is delivered & grouped in our lake as expected, then our data will be able to be aggregated & delivered to our data warehouse as expected.
  3. If data is aggregated & delivered to our data warehouse as expected, then the data in our warehouse can be transformed.
  4. If data in our warehouse can be transformed correctly, then our data will be able to be queried and will provide value for the business.
data warehouse

Let us be clear: this is an oversimplification of the data’s life cycle. That said, it illustrates how having observability only for the tables in your warehouse & the downstream pipelines leaves you in a position of blind faith.

In the ideal world, you would be able to set up monitoring capabilities & data health checkpoints everywhere in your system. This is no small project for most data-intensive organizations; some would even argue it’s impractical.

Realistically, one of the best places to start your observability initiative is at the beginning of the data value chain; at the data extraction layer.

Data Value Chain + Shift-left Data Observability

If you are one of these data-driven organizations, how do you set your data team up for

success?

While it’s important to have observability of the critical “checkpoints” within your system, the most important checkpoint you can have is at the data collection process. There are two reasons for that:

#1 – Ingesting data from external sources is one of the most vulnerable stages in your data model.

As a data engineer, you have some degree of control over your data & your architecture. But what you don’t control is your external data sources. When you have a data product that depends on external data arriving on time to function, that is an extremely painful experience.

This is best highlighted in an example. Let’s say you are running a large real estate platform called Willow. Willow is a marketplace where users can search for homes and apartments to buy & rent across the United States.

Willow’s goal is to give users all the information they need to make a buying decision; things like listing price, walkability scores, square footage, traffic scores, crime & safety ratings, school system ratings, etc.

In order to calculate “Traffic Score” for just one state in the US, Willow might need to ingest data from 3 external data sources. There are 50 states, so that means you suddenly have 150 external data sources you need to manage. And that’s just for one of your metrics.

Here’s where the pain comes in: You don’t control these sources. You don’t get a say whether they decide to change their API to better fit their data model. You don’t get to decide whether they drop a column from your dataset. You can’t control if they miss one of their data deliveries and leave you hanging.

All of these factors put your carefully crafted data model at risk. All of them can break your pipelines downstream that follow strictly coded logic. And there’s really nothing you can do about it except catching it as early as you can.

Having data observability in your data warehouse doesn’t so much to solve this problem. It might alert you that there is bad data in your warehouse, but by that point, it’s already too late.

This brings us to our next point…

#2 – It makes the most sense for your operational flow.

In many large data organizations, data in your warehouse is being automatically utilized in your business processes. If something breaks your data collection processes, bad data is being populated into your product dashboards and analytics and you have no way of knowing that the data they are being served is no good.

This can lead to some tangible losses. Imagine if there was a problem calculating a Comparative Analysis of home sale prices in the area. Users may lose trust in your data and stop using your product.

In this situation, what does your operational flow for incident management look like?

You receive some complaints from business stakeholders or customers, you have to invest a lot of engineering hours to perform root cause analysis, fix the issue, and backfill the data. All the while consumer trust has gone down, and SLAs have already been missed. DataOps is in a reactive position.

incident management

When you have data observability for your ingestion layer, there’s still a problem in this situation, but the way DataOps can handle this situation is very different:

  • You know that there will be a problem.
  • You know exactly which data source is causing the problem.
  • You can project how this will affect downstream processes. You can make sure everyone downstream knows that there will be a problem so you can prevent the bad data from being used in the first place.
  • Most importantly, you can get started resolving the problem early & begin working on a way to prevent that from happening again.

You cannot achieve that level of prevention when your data observability starts at your

warehouse.

Bottom Line: Time To Shift Left

DataOps is learning many of the same, hard lessons as DevOps has. Just as application observability is the most effective when shifted left, the same applies to data operations. It saves money; it saves time; it saves headaches. If you’re ingesting data from many external data sources, your organization cannot afford to focus all its efforts on the warehouse. You need real end-to-end data observability. And luckily, there’s a great data observability platform made to do just that.

Data observability that's built to improve operational flow

Implement end-to-end observability for your entire solutions stack so your team can build better performing and more reliable data products.

Data Replication: The Basics, Risks, and Best Practices

Databand
2022-04-27 13:20:30

Data Replication: The Basics, Risks, and Best Practices

Data-driven organizations are poised for success. They can make more efficient and accurate decisions and their employees are not impeded by organizational silos or lack of information. Data replication enables leveraging data to its full extent. But how can organizations maximize the potential of data replication and make sure it helps them meet their goals? Read on for all the answers.

What is Data Replication?

Data replication is the process of copying or replicating data from the main organizational server or cloud instance to other cloud or on-premises instances at different locations. Thanks to data replication, organizational users can access the data they need for their work quickly and easily, wherever they are in the world. In addition, data replication ensures organizations have backups of their data, which is essential in case of an outage or disaster. In other words, data replication creates data availability at low latency.

Data replication can take place either synchronously or asynchronously. Synchronously means the data is constantly copied to the main server and all replica servers at the same time. Asynchronous data replication means that data is first copied to the main server and only then copied to replica servers. Often, it occurs in scheduled intervals.

Why Data Replication is Necessary

Data replication ensures that organizational data is always available to all stakeholders. By replicating data across instances, organizations can ensure:

Scalability

Data scalability is the ability to handle changing demands by continuously adapting resources. Replication of data across multiple servers builds scalability and ensures the availability of consistent data to all users at all times.

Disaster Protection

Electrical outages, cybersecurity attacks and natural disasters can cause systems and instances to crash and no longer be available. By replicating data across multiple instances, data is backed up and always accessible to any stakeholder. This ensures system robustness, organizational reliability and security.

Speed / Latency

Data that has to travel across the globe creates latency. This creates a poor user experience, which can be felt especially in real-time based applications like gaming or recommendation systems, or resource-heavy systems like design tools. By distributing the data globally it travels a shorter distance to the end user, which results in increased speed and performance.

Test System Performance

By distributing and synchronizing data across multiple test systems, data becomes more accessible. This availability improves their performance.

An Example of Data Replication

Organizations that have multiple branch offices across a number of continents can benefit from data replication. If organizational data only resides in servers in Europe, users from Asia, North America and South America will experience latency when attempting to read the data. But by replicating data across instances in San Francisco, São Paulo, New York, London, Berlin, Prague, Tel Aviv, Hyderabad, Singapore and Melbourne, for example, all users can improve access times for all users significantly.

Data Replication Variations

Types of Data Replication

Replication systems vary. Therefore, it is important to distinguish which type is a good fit for your organizational infrastructure needs and business goals. There are three main types of data replication systems:

Transactional Replication

Transaction replication consists of databases being copied in their entirety from the primary server (the publisher) and sent to secondary servers (subscribers). Any data changes are consistently and continuously updated. Transactional consistency is ensured, which means that data is replicated in real-time and sent from the primary server to secondary servers in the order of their occurrence. As a result, transactional replication makes it easy to track changes and any lost data. This type of replication is commonly used in server-to-server environments.

Snapshot Replication

In the snapshot replication type, a snapshot of the database is distributed from the primary server to the secondary servers. Instead of continuous updates, data is sent as it exists at the time of the snapshot. It is recommended to use this type of replication when there are not many data changes or at the initial synchronization between the publisher and subscriber.

Merge Replication

A merge replication consists of two databases being combined into a single database. As a result, any changes to data can be updated from the publisher to the subscribers. This is a complex type of replication since both parties (the primary server and the secondary servers) can make changes to the data. It is recommended to use this type of replication in a server-to-client environment.

Comparison Table: Transactional Replication vs. Snapshot replication vs. Merge Replication

Data Replication Table

Schemes of Replication

Replication schemes are the operations and tasks required to perform replication. There are three main replication schemes organizations can choose from:

Full Replication

Full replication occurs when the entire database is copied in its entirety to every site in the distributed system. This scheme improves data availability and accessibility through database redundancy. In addition, performance is improved because global distribution of data reduces latency and accelerates query execution. On the other hand, it is difficult to achieve concurrency and update processes are slow.

Data Replication - Full

Partial Replication

In a partial replication scheme, some sections of the database are replicated across some or all of the sites. The description of these fragments can be found in the replication schema. Partial replication enables prioritizing which data is important and should be replicated as well as distributing resources according to the needs of the field.

Data Replication - Partial

No Replication

In this scheme, data is stored on one site only. This enables easily recovering data and achieving concurrency. On the other hand, it negatively impacts availability and performance.

No Data Replication

Techniques of Replication

Replicating data can take place through different techniques. These include:

Full-table Replication

In a full-table replication, all data is copied from the source to the destination. This includes new data, as well as existing data. It is recommended to use this technique if records are regularly deleted or if other techniques are technically impossible. On the other hand, this technique requires more processing and network resources and the cost is higher.

Key-based Replication

In a key-based replication, only new data that has been added since the previous update, is updated. This technique is more efficient since less rows are copied. On the other hand, it does not enable replication data from a previous update that might have been hard-deleted.

Log-based Replication

A log-based replication replicates any changes to the database, from the DB log file. It applies only to database sources and has to be supported by it. This technique is recommended when the source database structure is static, otherwise it might become a very resource-intensive process.

Cloud Migration + Data Replication

When organizations digitally transform their infrastructure and migrate to the cloud, data can be replicated to cloud instances. By replicating data to the cloud, organizations can enjoy its benefits: scalability, global accessibility, data availability and easier maintenance. This means organizational users benefit from data that is more accessible, usable and reliable, which eliminates internal silos and increases business agility.

Data Risks in the Replication Process

When replicating data to the cloud, it is important to monitor the process. The growing complexity of data systems as well as the increased physical distance between servers within a system could pose some risks.

These risks include:

Inconsistency

Data schema and data profiling anomalies, like null counts, type changes and skew.

Data Loss

Ensuring all data has been migrated from the sources to the instances.

Delays

Data not being successfully migrated on time.

Data Replication Management + Observability

By implementing a management system to oversee and monitor the replications process, organizations can significantly reduce the risks involved in the data replication process. A data observability platform will ensure:

  • Data is successfully replicated to other instances, including cloud instances
  • Replication and migration pipelines are performing as expected
  • Any broken pipelines or irregular data volumes are alerted about so they can be fixed
  • Data is delivered on time
  • Delivered data is reliable, so organizational stakeholders can use it for analytics

Monitoring

By monitoring the data pipelines that take part in the replication process, organizations and their DataOps engineer can ensure the data propagated through the pipeline is accurate, complete and reliable. This ensures data replicated to all instances can be reliably used by stakeholders. An effective monitoring system will be:

  • Granular – specifically indicating where the issue is
  • Persistent – following lineage to understand where errors began
  • Automated – reducing manual errors and enabling the use of thresholds
  • Ubiquitous – covering the pipeline end-to-end
  • Timely – enabling catching errors on time before they have an impact

Learn more about data monitoring here.

Tracking

Tracking pipelines enables systematic troubleshooting, so that any errors are identified and fixed on time. This ensures users constantly benefit from updated, reliable and healthy data in their analyses. There are various types of metadata that can be tracked, like task duration, task status, when data was updated, and more. By tracking and alerting (see below) in case of anomalies, DataOps engineers ensure data health.

Alerting

Alerting about and data pipeline anomalies is an essential step that closes the observability loop. Alerting DataOps engineers gives them the opportunity to fix any data health issues that might affect data replication across various instances.

Within existing data systems, data engineers can trigger alerts for:

  • Missed data deliveries
  • Schema changes that are unexpected
  • SLA misses
  • anomalies in column-level statistics like nulls and distributions
  • Irregular data volumes and sizes
  • Pipeline failures, inefficiencies, and errors

By proactively setting up alerts and monitoring them through dashboards and other tools of your choice (Slack, Pagerduty, etc.), organizations can truly maximize the potential of data replication for their business.

Conclusion

Data replication holds great promise for organizations. By replicating data to multiple instances, they can ensure data availability and improved performance, as well as internal “insurance” in case of a disaster. This page covers the basics for any business or data engineer getting started with data replication: the variations, schemes and techniques, as well as more advanced content for monitoring the process to gain observability and reduce the potential risk.

Wherever you are on your data replication journey, we recommend auditing your pipelines to ensure data health. If you need help finding and fixing data health issues fast, click here.

The Top Data Quality Metrics You Need to Know (With Examples)

Databand
2022-04-20 14:17:41

The Top Data Quality Metrics You Need to Know (With Examples)

Data quality metrics can be a touchy subject, especially within the focus of data observability.

A quick google search will show that data quality metrics involve all sorts of categories.

For example, completeness, consistency, conformity, accuracy, integrity, timeliness, continuity, availability, reliability, reproducibility, searchability, comparability, and probably ten other categories I forgot to mention all relate to data quality.

So what are the right metrics to track? Well, we’re glad you asked. 🙂

We’ve compiled a list of the top data quality metrics that you can use to measure the quality of the data in your environment. Plus, we’ve added a few screenshots that highlight each data quality metric you can view in Databand’s observability platform.

Take a look and let us know what other metrics you think we need to add!

Collection Data Quality Metrics

The Top 9 Data Quality Metrics

Metric 1: # of Nulls in Different Columns

Who’s it for?

  • Data engineers
  • Data analysts

How to track it?

Calculate the number of nulls, non-null counts, and null percentages per column so users can set an alert on those metrics.

Why it’s important?

Since a null is the absence of value, you want to be aware of any nulls that pass through your data workflows.

For example, downstream processes might be damaged if the data used is now “null” instead of actual data.

Dropped columns

The values of a column might be “dropped” by mistake when the data processes are not performing as expected.

This might cause the entire column to disappear, which would make the issue easier to see. But sometimes, all of its values will be null.

Data drift

The data of a column might slowly drift into “nullness.”

This is more difficult to detect than the above since the change is more gradual. Monitoring anomalies in the percentage of nulls across different columns should make it easier to see.

What’s it look like?

Data Quality Metrics Null Count

Metric 2: Frequency of Schema Changes

Who’s it for?

  • Data engineers
  • Data scientists
  • Data analysts

How to track it?

Tracking all changes in the schema for all the datasets related to a certain job.

Why it’s important?

Schema changes are key signals of bad quality data.

In a healthy situation, schema changes are communicated in advance and are not frequent since many processes rely on the number of columns and their type in each table to be stable.

Frequent changes might indicate an unreliable data source and problematic DataOps practices, resulting in downstream data issues.

Examples of changes in the schema can be:

  • Column type changes
  • New columns
  • Removed columns

Go beyond having a good understanding of what changed in the schema and evaluate the effect this change will have on downstream pipelines and datasets.

What’s it look like?

Data Quality Metrics Schema change

Metric 3: Data Lineage, Affected Processes Downstream

Who’s it for?

  • Data engineers
  • Data analysts

How to track it?

Tack the data linage with assets that appear downstream from a dataset with an issue. This includes datasets and pipelines that consume the upstream dataset’s data.

Why it’s important?

The more damaged data assets (datasets or pipelines) downstream, the bigger the issue’s impact. This metric helps the data engineer to understand the severity of the issue and how fast he should fix it.

It is also an important metric for data analysts because most downstream datasets make up their company’s BI reports.

What’s it look like?

Data Quality Metrics Lineage

Metric 4: # of Pipeline Failures

Who’s it for?

  • Data engineers
  • Data executives

How to track it?

Track the number of failed pipelines over time.

Use tools to understand why the pipeline failed, root cause analysis through the error widget and logs, and the ability to dive inside all the tasks that the DAG contains.

Why it’s important?

The more pipelines fail, the more data health issues you’ll have.

Each pipeline failure causes issues like missing data operations, schema changes, and data freshness issues.

If you’re experiencing many failures, this indicates severe problems at the root that needs to be addressed.

What’s it look like?

Data Quality Metrics Error widget, pipeline, tasks

Metric 5: Pipeline Duration

Who’s it for?

  • Data engineers

How to track it?

The team can track this with the Airflow syncer, which reports on the total duration of a DAG run, or by using our tracking context as part of the Databand SDK.

Why it’s important?

Pipelines that work in complex data processes are usually expected to have similar duration across different runs.

In these complex environments, pipelines downstream depend on upstream pipelines processing the data in certain SLAs.

The effect of extreme changes in the pipeline’s duration can be anywhere between the processing of stale data and a failure of downstream processes.

What’s it look like?

Data Quality Metrics Pipeline duration

Metric 6: Missing Data Operations

Who’s it for?

  • Data engineers
  • Data scientists
  • Data analysts
  • Data executives

How to track it?

Tracking all the operations related to a particular dataset.

A data operation is a combination of a task in a specific pipeline that reads or writes to a table.

Why it’s important?

When a certain data operation is missing, a chain of issues in your data stack will be triggered. It can cause pipelines to fail, changes in the schema, and delay problems.

Also, the downstream consumers of this data will be affected by the data that didn’t arrive.

A few examples include:

  • The data analyst who is using this data for analysis
  • The ML models used by the data scientist
  • The data engineers in charge of the data.

What’s it look like?

Data Quality Metrics Missing dataset
ata Quality Metrics dbnd alert

Metric 7: Record Count in a Run

Who’s it for?

Data engineers, data analysts

How to track it?

Track the number of raws written to a dataset.

Why it’s important?

A sudden change in the expected number of table rows signals that too much data is being written.

Using anomaly detection in the number of rows in a dataset provides a good way of checking that nothing suspicious has happened.

What’s it look like?

Data Quality Metrics Record count in a run

Metric 8: # of Tasks Read From Dataset

Who’s it for?

Data engineer

How to track it?

The more tasks read from a certain dataset, the more central it is and the more important this dataset.

Why it’s important?

Understanding the importance of the dataset is crucial for impact analysis and realizing how fast you should deal with the issue you have.

What’s it look like?

Data Quality Metrics - Tasks Read from Dataset

Metric 9: Data Freshness (SLA alert)

Who’s it for?

Data Engineers, Data Scientists, Data Analysts

How to track it?

We are tracking the scheduled pipelines to write to a certain dataset.

Why it’s important?

Un-fresh and un-updated data can cause wrong feeding of downstream reports and wrong information to be consumed.

A good way of knowing data freshness is to monitor your SLA and get notified of delays in the pipeline that should be written to the dataset.

What’s it look like?

Data Quality Metrics SLA alert

Wrapping it up

And that’s a quick look at some of the top data quality metrics you need to know to deliver more trustworthy data to the business.

Check out how you can build all these metrics in Databand today.

What is a Data Catalog? Overview and Top Tools to Know

Databand
2022-04-14 12:01:00

What is a Data Catalog? Overview and Top Tools to Know

Intro to Data Catalogs

A data catalog is an inventory of all of an organization’s data assets. A data catalog includes assets like machine learning models, structured data, unstructured data, data reports, and more. By leveraging data management tools, data analysts, data scientists, and other data users can search through the catalog, find the organizational data they need, and access it.

Governance of data assets in a data catalog is enabled through metadata. The metadata is used for mapping, describing, tagging, and organizing the data assets. As a result, it can be leveraged to enable data consumers to efficiently search through assets and get information on how to use the data. Metadata can also be used for augmenting data management, by enabling onboarding automation, anomalies alerts, auto-scaling, and more.
In addition to indexing the assets, a data catalog usually includes data access and data searching capabilities, as well as tools for enriching the metadata, both manually and automatically. It also provides capabilities for ensuring compliance with privacy regulations and security standards.
In modern organizations, data catalogs have become essential for leveraging the large amounts of data generated. Efficient data analysis and consumption can help organizations make better decisions, so they can optimize operations, build better models, increase sales, and more.

Data Catalog Benefits (Why Do You Need a Data Catalog?)

A data catalog provides multiple benefits to data professionals, business analysts, and organizations. These include:

User Autonomy

Data professionals and other data consumers can find data, evaluate it and understand how to use it – all on their own. With a data catalog, they no longer have to rely on IT or other professional personnel. Instead, they can immediately search for the data they need and use it. This speed and independence enable injecting data into more business operations. It also improves employee morale.

Improved Data Context and Quality

The metadata and comments on the data from other data citizens can help data consumers better understand how to use it. This additional information creates context and improves the data quality and encourages data usage, innovation, and more new business ideas.

Organizational Efficiency

Accessible data reduces operational friction and bottlenecks, like back and forth emails, which optimizes the use of organizational resources. Available data also accelerates internal processes. When data consumers get the data and understand how to use it faster, data analysis and implementation take place faster as well, benefiting the business.

Compliance and Security

Data catalogs that ensure data assets comply with privacy standards and security regulations, and reduce the risks of data breaches, cyberattacks, or legal fiascos.

New Business Opportunities

By giving data citizens new information they can incorporate into their work and decision-making, they will find new ways to answer work challenges and achieve their business goals. This can open up new business opportunities, across all departments.

Better Decision Making

Lack of data visibility makes organizations rely on tribal knowledge, rely on data they are already familiar with, or recreate assets that already exist. This creates organizational data silos, which impede productivity. Enabling data access to everyone improves the ability to find and use data consistently and continuously across the organization.

What Does a Data Catalog Contain?

Different data catalogs offer somewhat different features. However, to enable data governance and advanced analysis, they should all provide the following to data consumers:

Metadata

Technical Metadata

The data that describes the structure of the objects, like tables, schemas, columns, rows, file names, etc.

Business Metadata

Data about the business value of the data, like its purpose, compliance info, rating, classification, etc.

Process Metadata

Data about the asset creation process and lineage, like who changed it and when, permissions, latest update time, etc.

Search Capabilities

Searching, browsing, and filtering options to enable data consumers to easily find the relevant data assets.

Metadata Enrichment

The ability to automatically enrich metadata through mappings and connections, as well as letting data citizens manually contribute to the metadata.

Compliance Capabilities

Embedded capabilities that ensure data can be trusted and no sensitive data is exposed. This is important for complying with regulations, standards, and policies.

Asset Connectivity

The ability to connect to and automatically map all types of data sources your organization uses, at the locations they reside at.

In addition, in technologically advanced and enterprise data catalogs, AI and machine learning are implemented.

Data Catalog Use Cases

Data catalogs can and should be consumed by all people in the organization. Some popular use cases include:

  • Optimizing the data pipeline
  • Data lake modernization
  • Self-service analytics
  • Cloud spend management
  • Advanced analytics
  • Reducing fraud risk
  • Compliance audits
  • And more

Who Uses a Data Catalog?

A data catalog can be used by data-savvy citizens, like data analysts, data scientists and data engineers. But all business employees – product, marketing, sales, customer success, etc – can work with data and benefit from a data catalog. Data catalogs are managed by data stewards.

Top 10 Data Catalog Tools

Here are the top 10 data catalog tools according to G2, as of Q1 2022:

1. AWS

  • Product Name: AWS Glue
  • Product Description: AWS Glue is a serverless data integration service for discovering, preparing, and combining data for analytics, machine learning and application development. Data engineers and ETL developers can visually create, run, and monitor ETL workflows. Data analysts and data scientists can enrich, clean, and normalize data without writing code. Application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

2. Aginity

  • Product NameAginity
  • Product Description: Aginity provides a SQL coding solution for data analysts, data engineers, and data scientists so they can find, manage, govern, share and re-use SQL rather than recode it.

3. Alation

  • Product Name: Alation Data Catalog
  • Product Description: ​​Alation’s data catalog indexes a wide variety of data sources, including relational databases, cloud data lakes, and file systems using machine learning. Alation enables company-wide access to data and also surfaces recommendations, flags, and policies as data consumers query in a built-in SQL editor or search using natural language. Alation connects to a wide range of popular data sources and BI tools through APIs and an Open Connector SDK to streamline analytics.

4. Collibra

  • Product Name: Collibra Data Catalog
  • Product Description: Collibra ensures teams can quickly find, understand and access data across sources, business applications, BI, and data science tools in one central location. Features include out-of-the-box integrations for common data sources, business applications, BI and data science tools; machine learning-powered automation capabilities; automated relationship mapping; and data governance and privacy capabilities.

5. IBM

  • Product NameIBM Watson Knowledge Catalog
  • Product Description:  A data catalog tool based on self-service discovery of data, models and more. The cloud-based enterprise metadata repository activates information for AI, machine learning (ML), and deep learning. IBM’s data catalog enables stakeholders to access, curate, categorize and share data, knowledge assets and their relationships, wherever they reside.

6. Appen

  • Product NameAppen
  • Product Description: Appen provides a licensable data annotation platform for training data use cases in computer vision and natural language processing. In order to create training data, Appen collects and labels images, text, speech, audio, video, and other data. Its Smart Labeling and Pre-Labeling features that use Machine Learning ease human annotations.

7. Denodo

  • Product NameDenodo
  • Product Description: Denodo provides data virtualization that enables access to the cloud, big data, and unstructured data sources in their original repositories. Denodo enables the building of customized data models for customers and supports multiple viewing formats.

8. Oracle

  • Product NameOracle Enterprise Metadata Management 
  • Product Description: Oracle Enterprise Metadata Management harvests metadata from Oracle and third-party data integrations, business intelligence, ETL, big data, database, and data warehousing technologies. It enables business reporting, versioning, and comparison of metadata models, metadata search and browsing, and data lineage and impact analysis reports.

9. Unifi

  • Product NameUnifi Data Catalog
  • Product Description: A standalone Data Catalog with intuitive natural language search powered by AI, collaboration capabilities for crowd-sourced data quality, views of trusted data, and all fully governed by IT. The Unifi Data Catalog offers data source cataloging, search and discovery capabilities throughout all data locations and structures, auto-generated recommendations to view and explore data sets and similar data sets, integration to catalog Tableau metadata, and the ability to deconstruct TWBX files and see the full lineage of a data source to see how data sets were transformed.

10. BMC

  • Product NameCatalog Manager for IMS
  • Product Description: A system database that stores metadata about databases and applications. Catalog Manager for IMS enables viewing IMS catalog content, reporting on the control block information in the IMS catalog, and creating jobs to do DBDGENs, PSBGENs, and ACBGENs to populate the catalog.

Data Lakes and Data Catalogs

A data catalog can organize and govern data that reside in repositories, data lakes, data warehouses, or other locations. A data catalog can help organize the unstructured data in the data lake, preventing it from turning into a “data swamp”. As a result, data scientists and data analysts can easily pull data from the lake, evaluate it and use it.

A Data Catalog and Databand

Databand is a proactive observability platform for monitoring and controlling data quality, as early as ingestion. By integrating Databand with your data catalog, you can gain extended lineage, and visualize and observe the data from its source and as it flows through the pipelines all the way to the assets the data catalog maps and governs. As a result, data scientists, engineers and other data professionals can see and understand the complete flow of data, end-to-end.

In addition, by integrating Databand with your data catalog, you can get proactive alerts any time your data quality is affected to increase governance and robustness. This is enabled through Databand’s data quality identification capabilities, combined with how data catalogs map assets to owners. Databand will communicate any data quality issues to the relevant data owners.

What is a Data Mesh Architecture?

Databand
2022-04-12 11:10:00

What is a Data Mesh Architecture?

Intro to Data Mesh

A data mesh is a form of platform architecture.

The goal of the data mesh in organizing a business’ platforms is to maximize the value of analytical data. This is done by minimizing the time needed to access quality data. A well-designed data mesh delivers cutting-edge efficiency, allowing researchers to quickly access data from any data accessible source within the data mesh system. The data mesh model may replace data lakes as the most popular way to store and retrieve data.

Three components support data mesh architecture: domain-supported data pipeline, data sources, and data infrastructure. There are layers of observability, data governance, and universal interoperability.

Data mesh systems are useful for businesses with multiple data domains.

Many companies have data stored in different databases and formats, causing research and analytics problems. Some companies have attempted to resolve these problems by creating a single data warehouse or central data lake and downloading all data to it. This solves its problems, such as accessing an inaccurate copy of the original data and outdated information.

Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage.

Data mesh architecture allows data access from a number of locations rather than one central data warehouse or data lake.

(It should be noted that there are situations where it is completely appropriate to build a central data lake as an additional part of the data mesh system.)

The Data Mesh Philosophy

The primary goal of data mesh is to create a system that maximizes the value of analytical data. The data mesh philosophy embraces a constantly changing data landscape, including increasing sources of data, the ability to transform data from one format to another, and improving the response time to change.

Four principles support the data mesh model:

  1. Federated computational governance.
  2. Domain-oriented and decentralized data ownership, as well as architecture.
  3. A self-serve platform as part of the data infrastructure.
  4. Data-as-a-product rather than a by-product.

Governance

Data mesh uses a system called federated computational governance. A federated model includes a cross-domain agreement describing which parts of the governance are managed by the data domains and which are handled by the provider. It is an autonomous system that is normally built and maintained by independent data teams for each domain. (Independent data teams can be made up of in-house staff or outside contractors). To get the maximum value, interoperability between data domains is a necessity.

The “federation” is a group of people made up of domain owners and the data mesh provider. While using a framework of globalized rules they decide on how to best govern the data mesh system.

Ideally, the governance federation will establish a data governance program that is common for all the domain owners. Domain owners can still develop their own data governance program, but an agreement providing a base level of data quality for the group as a whole will provide more trustworthy distributed data.

Decentralized Data Ownership

The concept of decentralized data ownership describes an architectural model in which data is not owned by a specific domain (department or business partner) but is freely shared with other business domains.

In the data mesh model, data is not owned or controlled by the people storing it – rather, it is stored and managed by the department or business partner, understanding that the data is meant to be shared.

The goal of the department or partner storing the data should be to offer it in a way that is easy to access and easy to work with.

The Self-Service Platform

The data mesh self-service platform, part of the architectural design, supports functionality from storage and processing to the data catalog. The self-service platform is an essential feature. The host or provider should supply a development platform that domain engineers can use for integrating the platform into their domain.

The model supports the use of autonomous domains. A “network” is a group of computers capable of communicating with each other and is needed to create a domain. A domain describes workstations, devices, computers, and database servers sharing data by way of network resources.

The self-service platform must be domain-agnostic (capable of working with multiple data domains) for the system to work. This allows each domain to be customized as needed. Additionally, the domain’s data engineering teams have the freedom to develop and design solutions for their specific issues. This design provides both flexibility and efficiency.

According to Zhamak Dehghani, the creator of the data mesh model, useful features for the data catalog include:

  • Data governance and standardization
  • Encryption for the data, both at rest and in motion
  • Data discovery, catalog registration, and publishing
  • Data schema
  • Data production lineage
  • Data versioning
  • Data quality metrics
  • Data monitoring, alerting, and logging

Monolithic Data Architectures vs Data Mesh Architecture

A good example of monolithic data architectures is a relational database management system (RDBMS) using a SQL database. The word monolithic means “all in one piece” rather than “too large and unable to be changed.” The phrase ‘monolithic data architectures’ describes a database management system using a variety of integrated software programs that work together to process data. With this design, data is not typically available for sharing with other organizations.

On the other hand, data mesh promotes data democratization and data sharing by allowing data-driven consumers to access data across all associated organizations. This results in more businesses making a profit from the same data.

A data mesh is decentralized and supports data owners sharing their data, being responsible for their own domains, and handling their own data products and pipelines. Sharing in the data mesh includes making their data available user-friendly and easily consumable.

The data mesh supports near-real-time data sharing because the data transmitted between domains use a “change data capture” (CDC) mechanism.

Data-as-a-Product

The data-as-a-product principle is an important foundation of the data mesh model and is philosophically opposed to data silos. The data mesh philosophy supports sharing data, and the purpose of a data silo is to isolate data. Data silos can be avoided through the use of cross-domain governance (per the federation) and semantic linking of data.

Data-as-a-product (as opposed to data-as-a-service) is used for decision making, developing personalized products, and fraud detection. Data-as-a-service tends to focus more on insights and strategy. Features such as trustworthiness, discoverability, and understandability are necessary for data to be treated as a product.

Preventing Data Silos

Data mesh systems eliminate the use of data silos. Data silos are data collections within an organization that has become isolated. The data it contains is typically available to one department but cannot be accessed by other parts of the business. This distorts the ability of good decision-making.

Silos are dangerous because they limit management’s understanding of the business, effectively blocking useful information.

Improved Data Analytics

In the last decade, the use of data analytics has increased steadily. Consequently, businesses are continuously attempting to improve the quality of their data. The data mesh model offers improved data collection and a remarkably efficient way of storing and managing data. It offers clean, accurate data for data analytics.

Data Pipelines

Data pipelines are an important part of the data mesh architectural model. As organizations take on increasingly complex analytic projects, data pipelines can assist in supplying quality data.

The data mesh model supports the total customization of data pipelines.

data pipeline is made up of a data source, a series of processing steps, and a destination. If the desired data is not located within the data platform, then it is collected at the beginning of the pipeline. After the collection, a number of steps are taken, with each step delivering an output that becomes the input for the next step.

A data pipelines process data between the initial ingestion source and the final destination. Steps that are common in a data pipeline include:

  • Data transformation
  • Filtering
  • Augmentation
  • Enrichment
  • Aggregating
  • Grouping
  • Running of algorithms against that data

These pipeline steps can be performed in parallel or in a time-sliced fashion.

Data Catalogs

A data catalog is the organized inventory of data for an organization. Metadata is used to help businesses organize and manage their data. The data catalog also uses metadata to help with data discovery and data governance. Data catalogs scan metadata automatically, allowing the catalog’s data consumers to seek and find their data. This includes information about the data’s availability, quality, and freshness.

Part of a data catalog’s function is to serve different end-users (data analysts, data scientists, business analysts, etcetera) who probably have different goals. A good data catalog will be user-friendly and flexible enough to adapt to its end-user’s needs.

As with the data pipeline, the data catalog supports data governance, offering a more thorough process. Data catalogs use a bottom-up approach to create an agile data governance program. People can use data catalogs to document legal obligations and track the life cycle of data.

Data Observability

Another benefit is data observability. It is a part of the data mesh architecture and part of its strategy. Data observability provides a pulse check on the data’s health and is also considered a best practice for businesses. Data observability uses various tools designed to manage and track an organization’s data reliability and quality.

Databand offers a proactive data observability platform that integrates into the data mesh architecture. The platform allows users to identify anomalies and see trends in the pipeline metadata. It can profile column statistics and explain the causes of unreliable data and its impact.

What is a Modern Data Platform? Understanding the Key Components

Databand
2022-04-06 13:48:03

What is a Modern Data Platform? Understanding the Key Components

A modern data platform should provide a complete solution for the processing, analyzing, and presentation of data. It is built as a cloud-first, cloud-native platform, and, normally, can be set up within a few hours. A modern data platform is supported not only by technology, but also by the Agile, DevOps, and DataOps philosophies.

Currently, data lakes and data warehouses are popular storage systems, but each comes with some limitations.

Data lakehouses and data mesh storage systems are two new systems attempting to overcome those limitations, and are showing signs of gaining popularity.

The modern data platform typically includes six foundational layers guided by principles of elasticity and availability.

The Philosophies

DevOps and DataOps have two entirely different purposes, but both are similar to the Agile philosophy, which is designed to accelerate project work cycles.

DevOps is focused on product development, while DataOps focuses on creating and maintaining a distributed data architecture system with the goal of creating business value from data.

Agile is a philosophy for software development that promotes speed and efficiency, but without eliminating the “human” factor. It places an emphasis on face-to-face conversations as a way to maximize communications and emphasizes automation as a way to minimize errors.

Data Ingestion

The process of placing data into a storage system for future use is called data ingestion. In simple terms, data ingestion means moving data taken from other sources to a central location. From there the data can be used for record-keeping purposes, or for further processing and analysis. Both analytics systems and downstream reporting rely on accessible, consistent, and accurate data.

Organizations make business decisions using the data from their analytics infrastructure. The value of their data is dependent on how well it is ingested and integrated. If there are problems during the ingestion process, such as missing data, every step of the analytics process will suffer.

Batch processing vs stream processing

Ingesting data can be done in different ways, and the way a particular data ingestion layer is designed can be based on different processing models. Data can come from a variety of distinct sources, ranging from SaaS platforms to the internet of things to mobile devices. A good ingestion model acts as a foundation for an efficient data strategy, and organizations normally choose the model best-suited for the circumstances.

Batch processing is the most common form of data ingestion. But it is not designed to deal with customers in real time. Instead it collects and groups source data into batches, which are sent  to the destination.

Batch processing may be initiated using a simple schedule, or it may be activated when certain conditions exist.  It is often used when the use of real-time data is not needed, as it is usually easier and less expensive than streaming ingestion.

Real-time processing (also referred to as streaming or stream processing) does not group data. Instead, data is obtained, transformed, and loaded as soon as it is recognized. Real-time processing is more expensive because it requires constant monitoring of data sources and accepts new information, automatically.

Data Pipelines

Modern data ingestion models, until recently, used an ETL (extract, transform, load procedure) to take data from its source, reformatting it, and then transporting it to its destination. This made sense when businesses had to use expensive in-house analytics systems, and doing the prep work before delivering it, including transformations, lowered costs.

That situation has changed, and more updated cloud data warehouses (Snowflake, Google BigQuery, Microsoft Azure, and others) can now cost-effectively scale their computing and storage resources. These improvements allow the preload transformation steps to be dropped, with raw data being delivered to the data warehouse.

At this point, the data can be translated into an SQL format, and then run within the data warehouse during research. This new processing arrangement has changed ETL to ELT (extract, load, transform).

Instead of extracting the data and then transforming it, with ELT data is transformed “after” it is in the cloud’s data warehouse.

Data Transformation

Data transformation deals with changing the values, structure, and format of data. This is often necessary for data analytics projects. Data can be transformed during one of two stages when using a data pipeline, before arriving at its storage destination, or after. Organizations still using on-premises data warehouses will normally use an ETL process.

Today, many organizations are using cloud-based data warehouses. These can scale computing and storage resources as needed. The ability of the cloud to scale allows businesses to bypass the preload transformations and send raw data into the data warehouse. The data is transformed after arriving, using an ELT process, typically when answering a query.

There are various advantages to transforming data:

  • Usability – Too many organizations sit on a bunch of unusable, unanalyzed data. Standardizing data and putting it under the right structure allows your data team to generate business value out of it.
  • Data quality – Transforming raw data can lead to missing values, poorly formatted variables, null rows, etcetera. (It is also possible to use data transformation to “improve” data quality.)
  • Better organization – transformed data is easier to process for both people and computers

Data Storage and Processing

Currently, The two most popular storage formats are data warehouses and data lakes. And then there are two storage formats that are gaining in popularity — the data lakehouse and data mesh. Modern data storage systems are focused on using data efficiently.

The Data Warehouse

Cloud-based data warehouses have been the preferred data storage system for a number of years because they can optimize computing power and processing speeds. They were developed much earlier than data lakes and can be traced back to the 1990s when databases were used for storage. The early versions of data warehouses were in-house and had very limited storage capacity. In 2013, many data warehouses shifted to the cloud and gained scalable storage.

The Data Lake

Data lakes were originally built on Hadoop, were scalable, and were designed for on-premises use. In January of 2008, Yahoo released Hadoop (based on NoSQL) as an open-source project to the Apache Software Foundation. Unfortunately, the Hadoop ecosystem is extremely complex and difficult to work with. Data Lakes began shifting to the cloud around 2015, making them much less expensive, and much more user-friendly.

Using a combination of data lakes and data warehouses to minimize their limitations has become a common practice.

The Data Lakehouse

Data lakes have problems with “parsing data.” They were originally designed to collect data in its natural format, without enforcing schema (formats), so that researchers could gain more insights from a broad range of data. Unfortunately, data lakes can become data swamps, with old, inaccurate information and useless information, making them much less effective.

Data warehouses are designed for managing structured data with clear and defined use cases.

For the data warehouse to function properly, the data must be collected, reformatted, cleaned, and uploaded to the warehouse. Some data, which cannot be reformatted, may be lost.

The data lakehouse has been designed to merge the strengths of data warehouses and lakes.

Data lakehouses are a new form of data management architecture. They merge the flexibility, cost-efficiency, and scaling abilities of data lakes with the ACID transactions and data management features of data warehouses.

Data lakehouses support business intelligence and machine learning. One of the data lakehouse’s strengths is its use of metadata layers. It also uses a new query engine, designed for high-performance SQL searches.

Data Mesh

Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage.

Data mesh, unlike data warehouses, lakes, and lakehouses, is “decentralized.” Decentralized data ownership is an architectural model where a specific domain (business partners or other departments) does not own their data, but shares data freely with other domains.

Data is not owned in the data mesh model. It is not owned by the people storing it — but they are responsible for it. The data is stored and organized by the business partner or department, with the knowledge the data is to be shared. This means all data within the data mesh system should maintain a uniform format.

Data mesh systems can be useful for businesses supporting multiple data domains. Within the data mesh design, there is a data governance layer and a layer of observability. There is also a universal interoperability layer.

Data Observability

Data observability has recently become a hot topic. Data observability describes the ability to watch and observe the state of data and its health. It covers a number of activities and technologies that, when combined, allow the user to identify and resolve data difficulties in near real-time.

Data observability platforms can be used with data warehouses, data lakes, data lakehouses, and data mesh.

It should be noted Databand has developed what is called a proactive data observability platform capable of catching bad data before it causes damage.

Observability allows teams to answer specific questions about what is taking place behind the scene in extremely distributed systems. Observability can show where data is moving slowly and what is broken.

Managers and/or teams can be sent alerts about potential problems and pro-actively solve them. (While the predictability feature can be helpful, it will not catch all problems, nor should it be expected to. Think of problem predictions as helpful, but not a guarantee.)

To make data observability useful, it needs to include these features:

  • SLA Tracking – This feature measures pipeline metadata and data quality against pre-defined standards.
  • Monitoring – A dashboard is provided, showing the operations of your system or pipeline.
  • Logging – Historical records (tracking, comparisons, analysis) of events are kept for comparison with newly discovered anomalies.
  • Alerting – Warnings are sent out for both anomalies and expected events.
  • Analysis – An automated detection process that adapts to your system.
  • Tracking –  Offers the ability to track specific events.
  • Comparisons – Provides a historical background, and anomaly alerts.

For many organizations, observability is siloed, meaning only certain departments can access the data. (This “should not” happen in a data mesh system, which philosophically requires the data to be shared, and is generally discouraged in most storage and processing systems.) Teams collect metadata on the pipelines they own.

Business Intelligence & Analytics

In 1865, the phrase ‘Business Intelligence’ was used in the Cyclopædia of Commercial and Business Anecdotes. This described how Sir Henry Furnese (who was a banker) profited from the information he gathered, and how he used it before his competition.

Currently, a great deal of business information is gathered from business analytics, as well as data analytics. Analytics is used to generate business intelligence by transforming data into understandable insights which can help to make tactical and strategic business decisions. Business intelligence tools can be used to access and analyze data, providing researchers with detailed intelligence.

Data Discovery

Data discovery involves collecting and evaluating data from different sources. It is often used to gain an understanding of the trends and patterns found in the data. Data discovery is sometimes associated with business intelligence because it can bring together siloed data for analysis.

Data discovery includes connecting a variety of data sources. It can clean and prepare data, and perform analytics. Inaccessible data is essentially useless data, and data discovery makes it useful.

Data discovery is about exploring data with visual tools which can help business leaders detect new patterns and anomalies.

What’s Coming Next?

If you search for “Modern Data Platform Trends” in Google, you’ll see many articles discussing trends on what’s next for the data platform. Topics like metadata management, building a metrics layer, and reverse ETL are getting a lot of focus.

However, the trend of data observability seems universally pervasive in all these articles. Data-driven companies can’t afford to constantly question whether or not the data they consume is reliable and trustworthy.

How Google (GCP) Ensures Delivery Velocity in their Data Stack

Databand
2022-04-05 14:53:38

How Google (GCP) Ensures Delivery Velocity in their Data Stack

Data stacks enable data integration throughout the entire data pipeline for trustworthy consumption. But how can companies ensure their data stack is both modern and reliable? In this blog post, we discuss these issues, as well as how GCP manages its data stack.

This blog post is based on a podcast where we hosted Sudhir Hasbe, senior director of product management for all data analytics services at Google Cloud.

You can listen to the entire episode below or here.

What is a Data Stack?

In today’s world, we have the capacity and ability to track almost any piece of data. But attempting to find relevant information in such huge volumes of data is not always so easy to do. A data stack is a tool suite for data integration. It transforms or loads data into a data warehouse, enables transformation through an engine running on top and provides visibility for building applications.

As companies evolve, they often move towards a modern data stack that is based on a cloud-based warehouse. Such a stack enables real-time, personalized experiences or predictions on SaaS applications by supporting real-time events and decision-making.

How do Modern Data Stacks Support Real-Time Events?

To support real-time events, modern data stacks include the following components:

  • A system for collecting events, like Kafka
  • A processing system, like a streaming analytics systems or a Spark streaming solution
  • A serving layer
  • A data lake or staging environment where raw data can be pushed and transformed before it gets loaded into a data warehouse
  • A data warehouse for structured data that enables creating machine learning models

Such an environment is very complex and requires controls to ensure high data quality. Otherwise, bad data will be pulled into different systems and create a poor customer experience.

Therefore, it is important to ensure data quality is taken into consideration as early as the design phase of the data stack.

How Can You Ensure Data Quality in the Data Stack?

As companies rely on more and more data sources, managing them becomes more complex. Therefore, to ensure data quality throughout the entire pipeline, it becomes important to understand the source of issues, i.e where they are originally coming from.

Data engineers who only look at the tables or the dashboards downstream will be wasting a lot of time trying to find where issues are coming from. They might be able to catch the problem, but by tracing them back to the source they will be able to debug and solve the issue much more quickly.

By shifting left observability requirements, data engineers can ensure data quality as early as ingestion, and enable much higher delivery velocity.

How Google (GCP) Ensures Delivery Velocity in the Data Stack

One of the main pain points data teams have when building and operationalizing a data stack is how to ensure delivery velocity.

This is true for both on-prem and cloud-native stacks, but becomes more pressing when companies are required to support real-time events both quickly and with high data quality.

To ensure delivery velocity at GCP, the team implements the following solutions.

End-to-end Pipelines

At GCP, one of the most critical components in the data stack is end-to-end pipelines. Thanks to these pipelines, they can ensure real-time events from various sources are available in their data warehouse, BigQuery. To cover streaming analytics use cases, data is seamlessly connected to BigTable and DataFlow.

Consistent Storage

At GCP, all storage capabilities are unified and consistent in a single data lake, enabling different types of processing on top of it. Thus, each persona can use their own skills and tools to consume it.

For example, a data engineer could use Java or Python, a data scientist could use notebooks or TensorFlow and an analyst could use other tools to analyze the data.

The Future of Data Management in the Data Stack

Here are three interesting predictions regarding the future of data stacks.

Leverage AI and ML to Improve Data Quality

One of the most interesting ideas when discussing the future of data management is about implementing AI and ML to improve data quality. Often, machine learning is used to improve business metrics.

At GCP, the team is implementing ML on BigQuery to identify failures. They find it is the only way to detect issues at scale. While this practice hasn’t been widely adopted by many companies, yet, it is expected to in the future.

Less Manual, More Automation

Automation is predicted to be widely adopted, as a means for managing the huge volumes of data of the future.

Today, manual management of legacy data platforms is complex, since it is based on components like Hadoop Spark running on a data warehouse with manually-defined rules and integrations with Jira to enable more personas to run queries.

The result is often hundreds of tickets with false alarms. In the future, automation will cover:

  • Metric collection
  • Alerts (without having to manually define rules)
  • New data sources and products

It will include automatic data discovery through automated cataloging of all information, including centralized metadata, automated lineage tracking across all systems and metadata lineage.

This will reduce the number of errors and streamline the process.

Persona Changes are Coming

Finally, we predict that the personas who process and consume the data will change as well. Data consumption will not be limited to data engineers, data scientists and analysts, but will be open to all business employees.

As a result, in the future, storage will just be a price-performance discussion for customers rather than capability differentiation.

The future sounds bright! Databand provides data engineers with observability into data sources straight from the source. To get a free trial, click here.

Why data engineers need a single pane of glass for data observability

Databand
2022-03-11 15:13:43

Why data engineers need a single pane of glass for data observability

Data engineers manage data from multiple sources and throughout pipelines. But what happens when a data deficiency occurs? Data observability provides engineers with the information and recommendations needed to fix data issues, without them having to comb through huge piles of data. Read on about what is observability and the best way to implement it.

If you’re interested in learning more, you can listen to the podcast this blog post is based on below or here.

What is Data Observability?

In today’s world, we have the capacity and ability to track almost any piece of data. But attempting to find relevant information in such huge volumes of data is not always so easy to do. Data observability is the techniques and methodologies that bring the right and relevant level of data information to data engineers at the right time, so they can understand problems and solve them.

Data observability provides data engineers with metrics and recommendations that help them understand how the system is operating. Through observability, data engineers can better set up systems and pipelines, observe the data as it flows through the pipeline, and investigate how it affects data that is already in their warehouse. In other words, data observability makes it easier for engineers to access their data and act upon any issues that occur.

With data observability, data engineers can answer questions like:

  • Are my pipelines running with the correct data?
  • What happens to the data as it flows through the pipelines?
  • What does my data look like once it’s in the warehouse, data lake, or lakehouse?

Why We Need Data Observability

Achieving observability is never easy, but ingesting data from multiple sources makes it even harder. Enterprises often work with hundreds of sources, and even nimble startups rely on a considerable number of data sources for their products. Yet, today’s data engineering teams aren’t equipped with tools and resources to manage all that complexity.

As a result, engineers are finding it difficult to ensure the reliability and quality of the data that is coming in and flowing through the pipelines. Schema changes, missing data, null records, failed pipelines, and more – all impact how the business can use data. If engineers can’t identify and fix data deficiencies before they make a business impact, the business can’t rely on it.

Achieving Data Observability with a Single Pane of Glass

The data ecosystem is fairly new and it is constantly changing. New open source and commercial solutions emerge all the time. As a result, the modern data stack is made up of multiple point solutions for data engineers. These include tools for ETLs, operational analytics, data warehouses, dbt, extraction and loading tools, and more. This fragmentation makes it hard for organizations to manage and monitor their data pipeline.

A recent customer in the Cryptocurrency Industry said this:

“We spend a lot of time fixing operational issues due to fragmentation of our data stack.“

Tracking data quality, lineage, and schema changes becomes a nightmare.”

The one solution missing from this stack is a single overarching operating system for orchestrating, integrating and monitoring the stack, i.e – a single tool for data observability. A single pane of glass for observability could enable engineers to look at various sources of data in a single place and see what has changed. They could identify changed schemas or faulty columns. Then, they could build automated checks to ensure errors wouldn’t recur.

For engineers, this is a huge time saver. For organizations, this means they can use their data for making decisions.

As we see the data ecosystem flow from fragmentation to consolidation, here are a few features a data observability system should provide data engineers with:

  • Visualization – enabling data engineers to see data reads, writes and lineage throughout the pipeline and the impact  of new data on warehouse data.
  • Supporting all data sources – showing data from all sources, and showing it as early as ingestion
  • Supporting all environments – observing all environments, pre-prod and prod, existing and new
  • Alerts – notifying data engineers about any anomalies, missed data deliveries, irregular volumes, pipeline failures or schema changes and providing recommendations for fixing issues
  • Continuous testing – running through data source, tables and pipelines multiple times a day, and even in real-time if your business case requires it (like in healthcare or gaming)

Databand provides unified visibility for data engineers across all data sources. Learn more here.

Monitoring for data quality issues as early as ingestion: here’s why

Databand
2022-01-24 15:17:13

Monitoring for data quality issues as early as ingestion: here’s why

Maintaining data quality is challenging. Data is often unreliable and inconsistent, especially when it flows from multiple data sources. To deal with quality issues and prevent them from impacting your product and decision-making, you need to monitor your data flows. Monitoring helps identify schema changes, discover missing data, catch unusual levels of null records, fix failed pipelines, and more. In this blog post, we will explain why we recommend monitoring data starting at the source (ingestion), list three potential use cases for ingestion monitoring, and finish off with three best practices for data engineers to get started.

This blog post is based on the first episode of our podcast, “Why Data Quality Begins at the Source”, which you can listen to below or here.

Where Should You Monitor Data Quality?

Data quality monitoring is a fairly new practice, and different tools offer different monitoring capabilities across the data pipeline. While some tools monitor data quality where the data rests, i.e at the data warehouse, at Databand we think it’s critical to monitor data quality as early as ingestion, not just at the warehouse level. Let’s look at a few reasons why.

4 Reasons to Monitor Data Quality Early in the Data Pipeline

1. Higher Probability of Identifying Issues

Erroneous or abnormal data affects all the other data and analytics downstream. Once corrupt data has been ingested and flows to the data lake/warehouse, it might already be mixed up with healthy data and used in analyses. This makes it much more difficult to identify the errors and their source, because the dirty data can be “washed out” in the higher volumes of data that sits at rest

In fact, the ability to identify issues is based on an engineer or analyst knowing what the expected data results should be, recognizing a problematic anomaly, and diagnosing that this anomaly is the result of data and not a business change. When the corrupt data is such a small percentage of the entire data lake, this becomes even harder.

Why take the chance of overlooking errors and problems that could impact the product, your users, and decision making? By monitoring early in the pipeline, many issues can be avoided because you are monitoring a more targeted sample of data, and therefore able to create a more precise baseline for when data looks unusual.

data quality monitoring
2. Creating Confidence in the Warehouse

Analysts and additional stakeholders rely on the warehouse to make a wide variety of business and product decisions. Trusting warehouse data is essential for business agility and for making the right decisions. If data in the warehouse is “known” to have issues, stakeholders will not use it or trust it. This means the organization is not leveraging data to the full extent.

If the data warehouse is the heart of the customer-facing product, i.e the product relies almost entirely on data, then corrupt data could jeopardize the entire product’s adoption in the market.

By quality assuring the data before it arrives to the warehouse and the main analytical system, teams can improve confidence in that “trusted layer.”

3. Ability to Fix Issues Faster

By identifying data issues faster, data engineers have more time to react. They can identify causality and lineage, and fix the data or source to prevent any harmful impact that corrupt data could have. Trying to identify and fix full-blown issues in the product or after decision-making is much harder to do.

4. Enabling Data Source Governance

By analyzing, monitoring and identifying the point of inception, data engineers can identify a malfunctioning source and act to fix it. This provides better governance over sources, in real-time and in the long-run.

When Should You Monitor Data Quality from Ingestion?

We recommend monitoring data quality across the pipeline, from ingestion and at rests. However, you need to start somewhere… Here are the top three use cases for prioritizing monitoring at ingestion:

  • Frequent Source Changes – When your business relies on data sources or APIs where data structure frequently changes, it is recommended to continuously monitor them. For example, in the case of a transportation application that pulls data from the constantly changing APIs of location data, user tracking information, etc.
  • Multiple External Data Sources – When your business’s output depends on analyzing data from dozens or hundreds of sources. For example, a real-estate app that provides listings based on data from offices, municipalities, schools, etc.
  • Data-Driven Products – When your product is based on data and each data source has a direct impact on the product. For example, navigation applications that pull data about roads, weather, transportation, etc.

Getting Started with Data Quality Monitoring

As mentioned before, data quality monitoring is a relatively new practice. Therefore, it makes sense to implement it gradually. Here are three recommended best practices:

1. Determine Quality Layers

Data changes across the pipeline, and so does its quality. Divide your data pipeline into various steps, e.g the warehouse layer, the transformation layer, and the ingestion layer. Understand that data quality means different things at each of these stages and prioritize the layers that have the most impact on your business.

2. Monitor Different Quality Depths

When monitoring data, there are different quality aspects to review. Start with reviewing metadata and ensuring the data structure was correct and that all the data arrived. Once metadata has been verified, move on to address explicit business-related aspects of the data, which relate to domain knowledge.

3. See Demos of Different Data Monitoring Tools

Once you’ve mapped out your priorities and pain points, it’s time to find a tool that can automate this process for you. Don’t hesitate to see demos of different tools and ask the hard questions about data quality assurance. For example, to see a demo of Databand, click here. To learn more about data quality and hear the entire episode this blog post was based on, visit our podcast, here.