The healthcare industry is very data-intensive. Multiple actors and organizations are transmitting large amounts of sensitive information. Data engineers in healthcare are tasked with ensuring data quality and reliability. This blog provides insights into how data engineers can proactively ensure data quality and prevent common errors by building the right data infrastructure and monitoring as early as ingestion.
This blog post is based on the podcast episode “Proactive Data Quality for Data-Intensive Organizations” with Johannes Leppae, Sr. Data Engineer at Komodo Health, which you can listen to below or here.
The Role of Data in Healthcare
The healthcare industry is made up of multiple institutions, service providers, and professionals. These include suppliers, doctors, hospitals, healthcare insurance companies, biopharma companies, laboratories, pharmacies, caregivers, and more. Each of these players creates, consumes, and relies on data for their operations.
High-quality and accurate data is essential for providing quality healthcare at low costs. For example, when running clinical trials, data is required to analyze patient populations, profile sites of care, alert when intervention is needed, and monitor the patient journey (among other needs).
Quality data will ensure a clinical trial is successful, resulting in better and faster patient treatment. However, erroneous or incomplete data could yield biased or noisy results, which could have severe consequences for patients.
Data Quality Challenges in Healthcare
Data engineers in healthcare need to reliably and seamlessly link together different types of sources and data. Then, they need to analyze the data to ensure it is complete and comprehensive so the downstream users have complete visibility.
However, the complexity of the healthcare system and the sensitivity of its data pose several data quality challenges for data engineers:
- Fragmentation – Data is divided between many data assets, each containing a small piece of information.
- Inconsistency – Data is created differently at each source. This includes variance between interfaces, filetypes, encryptions, and more.
- Maintaining privacy – In many cases, like clinical trials, data needs to be de-identified to protect patients and ensure results are not biased.
- Source orchestration – Ingesting data from multiple sources creates a lot of overhead when monitoring data.
- Domain knowledge – Processing and managing healthcare data requires industry-specific knowledge since the data is often subject to medical business logic.
Ensuring Data Quality as Early as Ingestion
To overcome these challenges, data engineers need to find methods for monitoring errors. Data engineers can ensure that any issues are captured early by getting the data ready at the ingestion point. This prevents corrupt data from reaching downstream users, assures regulation compliance, and ensures data arrives on time. Early detection also saves data engineers from having to rerun pipelines when issues are found.
How big is the detection difference? Early detection enables identifying issues within hours. Later in the pipeline, the same issue could take days to detect.
One recommended way to ensure and monitor data quality is through structure and automation. The ingestion pipeline includes the following steps (among others):
- Extraction of data files from external sources
- Consolidating any variations
- Pipeline orchestration
- Raw data ingestion
- Unifying file formats
To enable automation and scalability, it is recommended to create a unified structure across all pipelines and enforce systematic conventions for each stage.
For example, collecting metadata like source identification, environment, data stream, and more. The conventions will be checked in the validation step before moving the data files downstream.
How to Deal with Data Quality Issues
The challenges of data-intensive ingesting sometimes require finding creative solutions. In the podcast this blog post is based on, Johannes describes the following scenario his data engineering team deals with constantly.
A common delivery issue in healthcare is data deliveries being late. Komodo Health’s systems had defined logic that matched the file’s date with the execution date. However, since files were often sent late, the dates didn’t match, and the pipeline wouldn’t find the file. This required the team to rerun the pipeline manually. To overcome this issue, the data engineering team changed the logic so that the pipeline picked up all files within the file’s timestamp. The late delivery was then automatically captured without needing manual intervention again.
In some cases, however, fixing issues requires going back to the source and asking the data engineering team to fix it. To minimize these cases and the friction they might cause, it’s recommended to create agreements to ensure everyone is on the same page when setting up the process. The agreement should include expectations, delivery standards, and SLAs, among others.
You can also make suggestions that will help with deliveries. For example, when deliveries have multiple files, ask the source to add a manifest file that states the number of files, the number of records for each file, and the last file being sent.
Catching issues and bad batches of data on time is very important since it could significantly impact downstream users. It is especially important to be cautious in healthcare since analyses and life and death decisions are being made based on the data.
Choosing the Right Tools for Healthcare Data Engineering
Data engineers in healthcare face multiple challenges and require tools to assist them. While some prefer homegrown tools that support flexibility, buying a tool can relieve some of the effort and free engineers up for dealing with data quality issues.
When choosing a tool, it’s recommended to:
- Determine non-negotiables – features and capabilities the tool has to support.
- Decide on nice-to-haves – abilities that could help and make your life easier.
- Understand the roadmap – to see which features are expected to be added and determine how much influence you have over it.
Whichever tool you choose, make sure to see a demo of it. To see a demo of a Databand, which enables data quality monitoring as early as ingestion, click here.
To learn more about data-intensive organizations and hear the entire episode this blog post was based on, visit our podcast, here.