What Is a Modern Data Platform? | IBM

What is a modern data platform?

Explore IBM's modern data platform solution

Subscribe for AI updates

Illustration with collage of pictograms of clouds, pie chart, graph pictograms

What is a modern data platform?

A modern data platform is a suite of cloud-first, cloud-native software products that enable the collection, cleansing, transformation and analysis of an organization’s data to help improve decision making.

Today’s data pipelines have become increasingly complex and important for data analytics and making data-driven decisions. A modern data platform builds trust in this data by ingesting, storing, processing and transforming it in a way that ensures accurate and timely information, reduces data silos, enables self-service and improves data quality.

A modern data platform, also referred to as a modern data stack, is composed of five critical foundation layers: data storage and processing, data ingestion, data transformation, business intelligence (BI) and analytics and data observability.

The two fundamental principles that govern modern data platforms are:

Availability: Data is readily available in a data lake or data warehouses, which separate storage and compute. Splitting these functions makes it possible to store large amounts of data for relatively cheap.
Elasticity: Compute functions are cloud-based, which allows for auto-scalability. For example, if a majority of data and analytics is consumed at a certain day and time, processing can be automatically scaled up for a better customer experience and scaled back down as workload needs decrease.

IBM named a leader by IDC

Read why IBM was named a leader in the IDC MarketScape: Worldwide AI Governance Platforms 2023 report.

Related content

Read the guide for data leaders

Modern data platform philosophies

A modern data platform is supported not only by technology, but also by the DevOps, DataOps and agile philosophies. Although DevOps and DataOps have entirely different purposes, each is similar to the agile philosophy, which is designed to accelerate project work cycles.

DevOps is focused on product development, while DataOps focuses on creating and maintaining a distributed data architecture system that delivers business value from data.

Agile is a philosophy for software development that promotes speed and efficiency, but without eliminating the “human” factor. It places an emphasis on face-to-face conversations as a way to maximize communications, while also emphasizing automation as a means of minimizing errors.

Data storage and processing

The first foundational layer of a modern data platform is storage and processing.

Modern data storage systems are focused on using data efficiently, which includes where to store data and how to process it. The two most popular storage formats are data warehouses and data lakes, although data lakehouses and data mesh are gaining in popularity.

The data warehouse

Data warehouses are designed for managing structured data with clear and defined use cases.

The use of data warehouses can be traced back to the 1990s when databases were used for storing data. These data warehouses were on premises and had very limited storage capacity.

Around 2013, data warehouses began shifting to the cloud where scalability was suddenly possible. Cloud-based data warehouses have remained the preferred data storage system because they optimize compute power and processing speeds.

For a data warehouse to function properly, the data must be collected, reformatted, cleaned and uploaded to the warehouse. Any data which can’t be reformatted may be lost.

The data lake

In January of 2008, Yahoo released Hadoop (based on NoSQL) as an open-source project to the Apache Software Foundation. Data lakes were originally built on Hadoop, were scalable and designed for on-premises use. Unfortunately, the Hadoop ecosystem is extremely complex and difficult to use. Data lakes began shifting to the cloud around 2015, making them much less expensive and more user-friendly.

Data lakes were originally designed to collect raw, unstructured data without enforcing schema (formats) so that researchers could gain more insights from a broad range of data. Due to problems with parsing old, inaccurate or useless information, data lakes can become less-effective “data swamps”.

A typical data lake architecture might have data stored on an object storage like Amazon S3 from AWS, coupled with a tool like Spark to process the data.

The data lakehouse

Data lakehouses merge the flexibility, cost efficiency and scaling abilities of data lakes with the ACID (atomicity, consistency, isolation, and durability) transactions and data management features of data warehouses. (ACID is an acronym for the set of 4 key properties that define a transaction: atomicity, consistency, isolation and durability.)

Data lakehouses support BI and machine learning, while a key strength of the data lakehouse is that it uses metadata layers. Data lakehouses also use a new query engine, designed for high-performance SQL searches.

Data mesh

Unlike data warehouses, data lakes and data lakehouses, data mesh decentralizes data ownership. With this architectural model, a specific domain (e.g. business partner or department) does not own its data, but shares it freely with other domains. This means all data within the data mesh system should maintain a uniform format.

Data mesh systems can be useful for businesses supporting multiple data domains. Within the data mesh design, there is a data governance layer and a layer of observability. There is also a universal interoperability layer.

Data mesh can be useful for organizations that are expanding quickly and need scalability for storing data.

Data ingestion

The process of placing data into a storage system for future use is called data ingestion, which is the second layer of a modern data platform.

In simple terms, data ingestion means moving data from various sources to a central location. From there, the data can be used for record-keeping purposes or further processing and analysis, both of which rely on accessible, consistent and accurate data.

Organizations make business decisions using the data from their analytics infrastructure. The value of this data is dependent on how well it is ingested and integrated. If there are problems during the ingestion process, such as missing or outdated data sets, every step of the analytics process will suffer. This is especially true when it comes to big data.

Data processing models

Ingesting data can be done in different ways, and the way a particular data ingestion layer is designed can be based on different processing models. Data can come from a variety of distinct sources, including SaaS platforms, internet of things (IoT) devices and mobile devices. A good data processing model acts as a foundation for an efficient data strategy, so organizations must determine which model is best suited for their circumstances.

Batch processing is the most common form of data ingestion, although it is not designed for processing in real time. Instead, it collects and groups source data into batches, which are sent to the destination. Batch processing may be initiated using a simple schedule or activated when certain predetermined conditions exist. It is typically used when real-time data is not necessary, because it requires less work and is less expensive than real-time processing.
Real-time processing (also called streaming or stream processing) does not group data. Instead, data is obtained, transformed and loaded as soon as it is recognized. Real-time processing is more expensive because it requires constant monitoring of data sources and accepts new information automatically.

Data transformation

The next layer, data transformation, deals with changing the values, structure and format of data, which is often necessary for data analytics projects. Data can be transformed either before or after arriving at its storage destination when using a data pipeline.

Until recently, modern data ingestion models used an ETL (extract, transform, load) procedure to take data from its source, reformat it and transport it to its destination. This made sense when businesses had to use expensive in-house analytics systems. Doing the prep work before delivering it, including transformations, helped lower costs. Organizations still using on-premises data warehouses will normally use an ETL process.

Many organizations today prefer cloud-based data warehouses (IBM, Snowflake, Google BigQuery, Microsoft Azure and others) because they can scale compute and storage resources as needed. Cloud scalability allows preload transformations to be bypassed, so raw data can be sent to the data warehouse more quickly. The data is then transformed after arriving using an ELT (extract, load, transform) model—typically when answering a query.

At this point, the data can be translated into an SQL format and run within the data warehouse during research.

Data transformation has several advantages:

Usability: Standardizing data and putting it under the right structure allows your data engineering team to generate business value out of what would otherwise be unusable, unanalyzed data.

Data quality: Transforming raw data helps identify and rectify data errors, inconsistencies and missing values, leading to cleaner and more accurate data.
Better organization: Transformed data is easier for both people and computers to process.

Business intelligence and analytics

The fourth modern data platform layer is business intelligence (BI) and analytics tools.

In 1865, Richard Millar Devens presented the phrase “business intelligence” in the “Cyclopædia of Commercial and Business Anecdotes.” He used the term to describe how the banker Sir Henry Furnese profited from information by gathering it and using it before his competition.

Currently, a great deal of business information is gathered from business analytics, as well as data analytics. BI and analytics tools can be used to access, analyze and transform data into visualizations that deliver understandable insights. Providing researchers and data scientists with detailed intelligence can help them make tactical and strategic business decisions.

Data observability

The last of the five foundational layers of a modern data platform is data observability.

Data observability describes the ability to watch and observe the state of data and its health. It covers a number of activities and technologies that, when combined, allow the user to identify and resolve data difficulties in near real time.

Observability allows data engineering teams to answer specific questions about what is taking place behind the scenes in extremely distributed systems. It can show where data is moving slowly and what is broken.

Managers, data teams and various other stakeholders can be sent alerts about potential problems so that they can proactively solve them. While the predictability feature can be helpful, it does not guarantee that it will catch all problems.

To make data observability useful, it needs to include these features:

SLA tracking: Measures pipeline metadata and data quality against pre-defined standards.
Monitoring: A detailed dashboard that shows the operational metrics of a system or pipeline.
Logging: Historical records (tracking, comparisons, analysis) of events are kept for comparison with newly discovered anomalies.
Alerting: Warnings are sent out for both anomalies and expected events.
Analysis: An automated detection process that adapts to your system.
Tracking: Offers the ability to track specific metrics and events.
Comparisons: Provides a historical background, and anomaly alerts.

For many organizations, observability is siloed, meaning only certain departments can access the data. Philosophically, a data mesh system solves this by requiring the data to be shared, which is generally discouraged in traditional storage and processing systems.

Other modern data platform layers

In addition to the five foundational layers above, other layers that are common in a modern data stack include:

Data discovery

Inaccessible data is essentially useless data. Data discovery helps ensure it doesn’t just sit there. It is about collecting, evaluating and exploring data from different sources to help business leaders gain an understanding of the trends and patterns found in the data. It can clean and prepare data, and is sometimes associated with BI because it can bring together siloed data for analysis.

Data governance

Modern data platforms emphasize data governance and security to protect sensitive information, ensure regulatory compliance and manage data quality. Tools supporting this layer feature data access control, encryption, auditing and data lineage tracking.

Data catalog and metadata management

Data cataloging and metadata management are crucial for discovering and understanding available data assets. This helps users find the right data for their analysis.

Machine learning and AI

Some modern data platforms incorporate machine learning and AI capabilities for predictive analytics, anomaly detection and automated decision making.

Related products

IBM Databand

IBM^® Databand^® is observability software for data pipelines and warehouses that automatically collects metadata to build historical baselines, detect anomalies and triage alerts to remediate data quality issues.

Explore Databand

IBM DataStage

Supporting ETL and ELT patterns, IBM^® DataStage^® delivers flexible and near-real-time data integration both on premises and in the cloud.

Explore DataStage

IBM Knowledge Catalog

An intelligent data catalog for the AI era, IBM^® Knowledge Catalog lets you access, curate, categorize and share data, knowledge assets and their relationships—no matter where they reside.

Explore Knowledge Catalog

Resources

What is data observability?

Take a deep dive to understand what data observability is, why it matters, how it has evolved along with modern data systems and best practices for implementing a data observability framework.

What is ELT (Extract, Load, Transform)? A Beginner’s Guide

Learn what ELT is, how the process works, how it’s different from ETL, its challenges and limitations and best practices for implementing ELT pipelines.

A modern cloud data platform is the foundation of all intelligent supply chains

For years, enterprise supply chains have rested on the shaky foundations of disconnected, unverifiable and untimely data. Clean, connected data is the foundation of next-generation supply chain operations.

What is data science?

Learn how data science can unlock business insights and accelerate digital transformation and enable data-driven decision making.

Take the next step

Implement proactive data observability with IBM Databand today—so you can know when there’s a data health issue before your users do.

Explore Databand

Book a live demo