Intro to Data Mesh
A data mesh is a form of platform architecture.
The goal of the data mesh in organizing a business’ platforms is to maximize the value of analytical data. This is done by minimizing the time needed to access quality data. A well-designed data mesh delivers cutting-edge efficiency, allowing researchers to quickly access data from any data accessible source within the data mesh system. The data mesh model may replace data lakes as the most popular way to store and retrieve data.
Three components support data mesh architecture: domain-supported data pipeline, data sources, and data infrastructure. There are layers of observability, data governance, and universal interoperability.
Data mesh systems are useful for businesses with multiple data domains.
Many companies have data stored in different databases and formats, causing research and analytics problems. Some companies have attempted to resolve these problems by creating a single data warehouse or central data lake and downloading all data to it. This solves its problems, such as accessing an inaccurate copy of the original data and outdated information.
Data mesh can be quite useful for organizations that are expanding quickly, and need scalability for their data storage.
Data mesh architecture allows data access from a number of locations rather than one central data warehouse or data lake.
(It should be noted that there are situations where it is completely appropriate to build a central data lake as an additional part of the data mesh system.)
The Data Mesh Philosophy
The primary goal of data mesh is to create a system that maximizes the value of analytical data. The data mesh philosophy embraces a constantly changing data landscape, including increasing sources of data, the ability to transform data from one format to another, and improving the response time to change.
Four principles support the data mesh model:
- Federated computational governance.
- Domain-oriented and decentralized data ownership, as well as architecture.
- A self-serve platform as part of the data infrastructure.
- Data-as-a-product rather than a by-product.
Data mesh uses a system called federated computational governance. A federated model includes a cross-domain agreement describing which parts of the governance are managed by the data domains and which are handled by the provider. It is an autonomous system that is normally built and maintained by independent data teams for each domain. (Independent data teams can be made up of in-house staff or outside contractors). To get the maximum value, interoperability between data domains is a necessity.
The “federation” is a group of people made up of domain owners and the data mesh provider. While using a framework of globalized rules they decide on how to best govern the data mesh system.
Ideally, the governance federation will establish a data governance program that is common for all the domain owners. Domain owners can still develop their own data governance program, but an agreement providing a base level of data quality for the group as a whole will provide more trustworthy distributed data.
Decentralized Data Ownership
The concept of decentralized data ownership describes an architectural model in which data is not owned by a specific domain (department or business partner) but is freely shared with other business domains.
In the data mesh model, data is not owned or controlled by the people storing it – rather, it is stored and managed by the department or business partner, understanding that the data is meant to be shared.
The goal of the department or partner storing the data should be to offer it in a way that is easy to access and easy to work with.
The Self-Service Platform
The data mesh self-service platform, part of the architectural design, supports functionality from storage and processing to the data catalog. The self-service platform is an essential feature. The host or provider should supply a development platform that domain engineers can use for integrating the platform into their domain.
The model supports the use of autonomous domains. A “network” is a group of computers capable of communicating with each other and is needed to create a domain. A domain describes workstations, devices, computers, and database servers sharing data by way of network resources.
The self-service platform must be domain-agnostic (capable of working with multiple data domains) for the system to work. This allows each domain to be customized as needed. Additionally, the domain’s data engineering teams have the freedom to develop and design solutions for their specific issues. This design provides both flexibility and efficiency.
According to Zhamak Dehghani, the creator of the data mesh model, useful features for the data catalog include:
- Data governance and standardization
- Encryption for the data, both at rest and in motion
- Data discovery, catalog registration, and publishing
- Data schema
- Data production lineage
- Data versioning
- Data quality metrics
- Data monitoring, alerting, and logging
Monolithic Data Architectures vs Data Mesh Architecture
A good example of monolithic data architectures is a relational database management system (RDBMS) using a SQL database. The word monolithic means “all in one piece” rather than “too large and unable to be changed.” The phrase ‘monolithic data architectures’ describes a database management system using a variety of integrated software programs that work together to process data. With this design, data is not typically available for sharing with other organizations.
On the other hand, data mesh promotes data democratization and data sharing by allowing data-driven consumers to access data across all associated organizations. This results in more businesses making a profit from the same data.
A data mesh is decentralized and supports data owners sharing their data, being responsible for their own domains, and handling their own data products and pipelines. Sharing in the data mesh includes making their data available user-friendly and easily consumable.
The data mesh supports near-real-time data sharing because the data transmitted between domains use a “change data capture” (CDC) mechanism.
The data-as-a-product principle is an important foundation of the data mesh model and is philosophically opposed to data silos. The data mesh philosophy supports sharing data, and the purpose of a data silo is to isolate data. Data silos can be avoided through the use of cross-domain governance (per the federation) and semantic linking of data.
Data-as-a-product (as opposed to data-as-a-service) is used for decision making, developing personalized products, and fraud detection. Data-as-a-service tends to focus more on insights and strategy. Features such as trustworthiness, discoverability, and understandability are necessary for data to be treated as a product.
Preventing Data Silos
Data mesh systems eliminate the use of data silos. Data silos are data collections within an organization that has become isolated. The data it contains is typically available to one department but cannot be accessed by other parts of the business. This distorts the ability of good decision-making.
Silos are dangerous because they limit management’s understanding of the business, effectively blocking useful information.
Improved Data Analytics
In the last decade, the use of data analytics has increased steadily. Consequently, businesses are continuously attempting to improve the quality of their data. The data mesh model offers improved data collection and a remarkably efficient way of storing and managing data. It offers clean, accurate data for data analytics.
Data pipelines are an important part of the data mesh architectural model. As organizations take on increasingly complex analytic projects, data pipelines can assist in supplying quality data.
The data mesh model supports the total customization of data pipelines.
A data pipeline is made up of a data source, a series of processing steps, and a destination. If the desired data is not located within the data platform, then it is collected at the beginning of the pipeline. After the collection, a number of steps are taken, with each step delivering an output that becomes the input for the next step.
A data pipelines process data between the initial ingestion source and the final destination. Steps that are common in a data pipeline include:
- Data transformation
- Running of algorithms against that data
A data catalog is the organized inventory of data for an organization. Metadata is used to help businesses organize and manage their data. The data catalog also uses metadata to help with data discovery and data governance. Data catalogs scan metadata automatically, allowing the catalog’s data consumers to seek and find their data. This includes information about the data’s availability, quality, and freshness.
Part of a data catalog’s function is to serve different end-users (data analysts, data scientists, business analysts, etcetera) who probably have different goals. A good data catalog will be user-friendly and flexible enough to adapt to its end-user’s needs.
As with the data pipeline, the data catalog supports data governance, offering a more thorough process. Data catalogs use a bottom-up approach to create an agile data governance program. People can use data catalogs to document legal obligations and track the life cycle of data.
Another benefit is data observability. It is a part of the data mesh architecture and part of its strategy. Data observability provides a pulse check on the data’s health and is also considered a best practice for businesses. Data observability uses various tools designed to manage and track an organization’s data reliability and quality.
Databand offers a proactive data observability platform that integrates into the data mesh architecture. The platform allows users to identify anomalies and see trends in the pipeline metadata. It can profile column statistics and explain the causes of unreliable data and its impact.