What is a Modern Data Platform? Understanding the Key Components
A modern data platform should provide a complete solution for the processing, analyzing, and presentation of data. It is built as a cloud-first, cloud-native platform, and, normally, can be set up within a few hours. A modern data platform is supported not only by technology, but also by the Agile, DevOps, and DataOps philosophies.
Currently, data lakes and data warehouses are popular storage systems, but each comes with some limitations.
Data lakehouses and data mesh storage systems are two new systems attempting to overcome those limitations, and are showing signs of gaining popularity.
The modern data platform typically includes six foundational layers guided by principles of elasticity and availability.
DevOps and DataOps have two entirely different purposes, but both are similar to the Agile philosophy, which is designed to accelerate project work cycles.
Agile is a philosophy for software development that promotes speed and efficiency, but without eliminating the “human” factor. It places an emphasis on face-to-face conversations as a way to maximize communications and emphasizes automation as a way to minimize errors.
The process of placing data into a storage system for future use is called data ingestion. In simple terms, data ingestion means moving data taken from other sources to a central location. From there the data can be used for record-keeping purposes, or for further processing and analysis. Both analytics systems and downstream reporting rely on accessible, consistent, and accurate data.
Organizations make business decisions using the data from their analytics infrastructure. The value of their data is dependent on how well it is ingested and integrated. If there are problems during the ingestion process, such as missing data, every step of the analytics process will suffer.
Batch processing vs stream processing
Ingesting data can be done in different ways, and the way a particular data ingestion layer is designed can be based on different processing models. Data can come from a variety of distinct sources, ranging from SaaS platforms to the internet of things to mobile devices. A good ingestion model acts as a foundation for an efficient data strategy, and organizations normally choose the model best-suited for the circumstances.
Batch processing is the most common form of data ingestion. But it is not designed to deal with customers in real time. Instead it collects and groups source data into batches, which are sent to the destination.
Batch processing may be initiated using a simple schedule, or it may be activated when certain conditions exist. It is often used when the use of real-time data is not needed, as it is usually easier and less expensive than streaming ingestion.
Real-time processing (also referred to as streaming or stream processing) does not group data. Instead, data is obtained, transformed, and loaded as soon as it is recognized. Real-time processing is more expensive because it requires constant monitoring of data sources and accepts new information, automatically.
Modern data ingestion models, until recently, used an ETL (extract, transform, load procedure) to take data from its source, reformatting it, and then transporting it to its destination. This made sense when businesses had to use expensive in-house analytics systems, and doing the prep work before delivering it, including transformations, lowered costs.
That situation has changed, and more updated cloud data warehouses (Snowflake, Google BigQuery, Microsoft Azure, and others) can now cost-effectively scale their computing and storage resources. These improvements allow the preload transformation steps to be dropped, with raw data being delivered to the data warehouse.
At this point, the data can be translated into an SQL format, and then run within the data warehouse during research. This new processing arrangement has changed ETL to ELT (extract, load, transform).
Instead of extracting the data and then transforming it, with ELT data is transformed “after” it is in the cloud’s data warehouse.
Data transformation deals with changing the values, structure, and format of data. This is often necessary for data analytics projects. Data can be transformed during one of two stages when using a data pipeline, before arriving at its storage destination, or after. Organizations still using on-premises data warehouses will normally use an ETL process.
Today, many organizations are using cloud-based data warehouses. These can scale computing and storage resources as needed. The ability of the cloud to scale allows businesses to bypass the preload transformations and send raw data into the data warehouse. The data is transformed after arriving, using an ELT process, typically when answering a query.
There are various advantages to transforming data:
- Usability – Too many organizations sit on a bunch of unusable, unanalyzed data. Standardizing data and putting it under the right structure allows your data team to generate business value out of it.
- Data quality – Transforming raw data can lead to missing values, poorly formatted variables, null rows, etcetera. (It is also possible to use data transformation to “improve” data quality.)
- Better organization – transformed data is easier to process for both people and computers
Data Storage and Processing
Currently, The two most popular storage formats are data warehouses and data lakes. And then there are two storage formats that are gaining in popularity — the data lakehouse and data mesh. Modern data storage systems are focused on using data efficiently.
The Data Warehouse
Cloud-based data warehouses have been the preferred data storage system for a number of years because they can optimize computing power and processing speeds. They were developed much earlier than data lakes and can be traced back to the 1990s when databases were used for storage. The early versions of data warehouses were in-house and had very limited storage capacity. In 2013, many data warehouses shifted to the cloud and gained scalable storage.
The Data Lake
Data lakes were originally built on Hadoop, were scalable, and were designed for on-premises use. In January of 2008, Yahoo released Hadoop (based on NoSQL) as an open-source project to the Apache Software Foundation. Unfortunately, the Hadoop ecosystem is extremely complex and difficult to work with. Data Lakes began shifting to the cloud around 2015, making them much less expensive, and much more user-friendly.
Using a combination of data lakes and data warehouses to minimize their limitations has become a common practice.
The Data Lakehouse
Data lakes have problems with “parsing data.” They were originally designed to collect data in its natural format, without enforcing schema (formats), so that researchers could gain more insights from a broad range of data. Unfortunately, data lakes can become data swamps, with old, inaccurate information and useless information, making them much less effective.
Data warehouses are designed for managing structured data with clear and defined use cases.
For the data warehouse to function properly, the data must be collected, reformatted, cleaned, and uploaded to the warehouse. Some data, which cannot be reformatted, may be lost.
The data lakehouse has been designed to merge the strengths of data warehouses and lakes.
Data lakehouses are a new form of data management architecture. They merge the flexibility, cost-efficiency, and scaling abilities of data lakes with the ACID transactions and data management features of data warehouses.
Data lakehouses support business intelligence and machine learning. One of the data lakehouse’s strengths is its use of metadata layers. It also uses a new query engine, designed for high-performance SQL searches.
Data mesh can be quite useful for organizations that are expanding quickly and need scalability for their data storage.
Data mesh, unlike data warehouses, lakes, and lakehouses, is “decentralized.” Decentralized data ownership is an architectural model where a specific domain (business partners or other departments) does not own their data, but shares data freely with other domains.
Data is not owned in the data mesh model. It is not owned by the people storing it — but they are responsible for it. The data is stored and organized by the business partner or department, with the knowledge the data is to be shared. This means all data within the data mesh system should maintain a uniform format.
Data mesh systems can be useful for businesses supporting multiple data domains. Within the data mesh design, there is a data governance layer and a layer of observability. There is also a universal interoperability layer.
Data observability has recently become a hot topic. Data observability describes the ability to watch and observe the state of data and its health. It covers a number of activities and technologies that, when combined, allow the user to identify and resolve data difficulties in near real-time.
Data observability platforms can be used with data warehouses, data lakes, data lakehouses, and data mesh.
It should be noted Databand has developed what is called a proactive data observability platform capable of catching bad data before it causes damage.
Observability allows teams to answer specific questions about what is taking place behind the scene in extremely distributed systems. Observability can show where data is moving slowly and what is broken.
Managers and/or teams can be sent alerts about potential problems and pro-actively solve them. (While the predictability feature can be helpful, it will not catch all problems, nor should it be expected to. Think of problem predictions as helpful, but not a guarantee.)
To make data observability useful, it needs to include these features:
- SLA Tracking – This feature measures pipeline metadata and data quality against pre-defined standards.
- Monitoring – A dashboard is provided, showing the operations of your system or pipeline.
- Logging – Historical records (tracking, comparisons, analysis) of events are kept for comparison with newly discovered anomalies.
- Alerting – Warnings are sent out for both anomalies and expected events.
- Analysis – An automated detection process that adapts to your system.
- Tracking – Offers the ability to track specific events.
- Comparisons – Provides a historical background, and anomaly alerts.
For many organizations, observability is siloed, meaning only certain departments can access the data. (This “should not” happen in a data mesh system, which philosophically requires the data to be shared, and is generally discouraged in most storage and processing systems.) Teams collect metadata on the pipelines they own.
Business Intelligence & Analytics
In 1865, the phrase ‘Business Intelligence’ was used in the Cyclopædia of Commercial and Business Anecdotes. This described how Sir Henry Furnese (who was a banker) profited from the information he gathered, and how he used it before his competition.
Currently, a great deal of business information is gathered from business analytics, as well as data analytics. Analytics is used to generate business intelligence by transforming data into understandable insights which can help to make tactical and strategic business decisions. Business intelligence tools can be used to access and analyze data, providing researchers with detailed intelligence.
Data discovery involves collecting and evaluating data from different sources. It is often used to gain an understanding of the trends and patterns found in the data. Data discovery is sometimes associated with business intelligence because it can bring together siloed data for analysis.
Data discovery includes connecting a variety of data sources. It can clean and prepare data, and perform analytics. Inaccessible data is essentially useless data, and data discovery makes it useful.
Data discovery is about exploring data with visual tools which can help business leaders detect new patterns and anomalies.
What’s Coming Next?
If you search for “Modern Data Platform Trends” in Google, you’ll see many articles discussing trends on what’s next for the data platform. Topics like metadata management, building a metrics layer, and reverse ETL are getting a lot of focus.
However, the trend of data observability seems universally pervasive in all these articles. Data-driven companies can’t afford to constantly question whether or not the data they consume is reliable and trustworthy.