> Episode Details

Proactive Data Quality for Data-Intensive Organizations

How do you ensure data quality when your business relies on data from a high variety of external data sources? Johannes Leppä, Sr. Data Engineer at Komodo Health, shares his insights on how a data-intensive operation can design its data infrastructure to prevent common errors and high-impact issues. Johannes offers his tips for data teams to get ahead of data errors and proactively manage data quality.

About Our Guests

Johannes Leppä

Senior Data Engineer Komodo Health

Dr. Johannes Leppä is a Sr. Data Engineer and a Tech Lead of a data ingestion team at Komodo Health. For the past several years, he has been building solutions to integrate complex data from vastly varying sources in a systematic and scalable manner, with the duties varying from operational tasks to designing ingestion system architecture and implementing complex data transformation pipelines.

 

Johannes holds a PhD in physics from the University of Helsinki and prior to transitioning to data engineering he was conducting research in the field of aerosol physics at the California Institute of Technology. Even further in the past, he wanted to become a blacksmith, and still dabbles in chainmaille weaving every now and then, but nowadays he gets most of his metal through speakers.

Josh Benamram

Co-founder & CEO Databand.ai

Josh is Co-Founder and CEO at Databand.ai. He started his career in the finance world, working as an analyst at a quant investment firm called SIG. He then worked as an analyst at Bessemer Venture Partners, where he focused on data and ML company investments. Just prior to founding Databand, he was a product manager at Sisense, a big data analytics company. He started Databand with his two co-founders to help engineers deliver reliable, trusted data products.

Episode Transcript

Honor Hey, Josh, how’s it going?

Josh It’s going great. I on are excited to be on the program again today, talking all things data. Excited to have Johannes Leppä on the program today. Johannes is a senior data engineer at Komodo Health, and we have him in to talk about his experiences as a data engineer in building the data infra over at Komodo So great to have you are in on us. Thanks for joining us. Why don’t you start by just telling us a bit about your background? All right.

Johannes Thank you, Josh, for the introduction. Thank you, honor. It’s certainly a pleasure to be here talking with you about the data stuff today. Very excited about that. But to start with a little bit about myself, I’ve been now working for a handful of years at Komodo health, in different kinds of roles under data engineering and most of the time. For the past several years, I’ve been mostly working on data ingestion and for the past couple of years, leading different data ingestion teams that we have. That’s kind of what I’ve been lately concentrating on.

Josh Very cool. Do you want to tell us a bit about Komodo health and what you do for customers?

Johannes Yeah, certainly. So the mission of Komodo health is to reduce the global burden of disease. And for that, we are offering a suite of applications, for example, to say, analyze patient populations or sites of care for profiling, influence of health care providers in specific therapy areas for clinical alerting for engagement that enables more timely intervention along patient journey and to give just one very concrete example of that. Let’s say we have a client that’s about to launch a clinical trial. Maybe they want to analyze that patient population to see if there are more potential patients for their trial, and they would need more of those for the trial to be successful. So that’s kind of just a very concrete example, but kind of all these different solutions that we provide are powered our very comprehensive data assets called health care map. We are a very data heavy company.

Josh Interesting. So what would be a kind of insight that you provide to your end users, what’s an analytic at the end of the stream that they may be excited about receiving?

Johannes Yeah, so that there’s a lot of different use cases for that kind of what I do and a very high level touched upon like, for example, in this like I’m like one concrete example of a clinical trial. It could be like understanding whether suitable populations would be. And that would in itself be very valuable. And potentially, let’s say there could be even a chance to like, choose the site of your clinical trial and it would like to do that. So that’s close to the patients so that you would be getting that. And kind of on the other end of the spectrum, like, let’s say there is a therapy that is specific for a very, let’s say, rare disease. It’s very hard to to find like what would be the patients that would benefit from this therapy, but maybe it would need to be very timely intervention for them to to get that benefit. So then this sort of like like alerting information of like clinical events that are indicators of a patient that would benefit off that therapy? And then that information might be something that would then be useful for our for that sort of clients. So there’s a kind of a big variation on these use cases.

Josh Interesting. And what are the kinds of data sources that you’re actually plugging into and ingesting from to be able to pull the data that you need for that, those types of insights?

Johannes Yeah. So that is where things get complex because it’s a very the health care system is very complex like kind of with those examples already provide this a lot of different entities. There are health care providers and organizations like doctors and hospitals. There are healthcare insurance companies that are biopharma companies that develop and provide products, laboratory testing. And then obviously there’s the patient and their patient interactions with this system. There are visits to your doctor, stays in hospital labs tested and this is. A lot of entities, a lot of actions, and the data tends to be very fragmented, and that’s a big problem that different kinds of data assets usually cover a very small piece of this information. So in order to have like like the ability to answer very complex questions, you want to have a very complete view of that. And now a lot of this like like a lot of these sources, if you only have like a very like incomplete picture, you can have a lot of bias in your in your answers. Like maybe you only know about patients, officer type, certain type of populations, or even just only a certain type of geographical areas. And that might not be what you’re actually looking for. So then. Going into those like sources directly, we don’t really like create any of the data, so all the sources are kind of external to us. And there are several like publicly available and proprietary data sources that we have to give some examples, like, for example, of clinical trials. There is publicly available information of that. That’s what all the clinical trials happening in the world. But then as an example of like a proprietary dataset like earlier this year, we announced partnership with Blue Health Intelligence, and the related dataset covers like medical and pharmacy claims of more than 300 million de-identified people. And that would be kind of like example of a dataset on like very different end of the spectrum.

Josh And it is a lot of your secret sauce is in how you take data from all these different sources and aggregated and in interesting ways together so that you can see how different associations in the data might lead to new kinds of analytics and insights, like seeing how patient data from a hospital relates to data that’s being picked up from different pharma studies, things like that. Is it a lot about how you aggregate the data together, or is it more how you extract insights from all these different unique data sources and sort of different channels of analytics that you’re doing?

Johannes Yeah. So it is certainly a combination of those that there are. One thing is that because of this like fragmentation, being able to reliably link the data sources and different types of information together is certainly a big aspect of the of the quality of the dataset and about the value that Komodo is bringing. On the other hand, you need a very complete set of inputs that even if you are able to the datasets that you have, your able to link them correctly. But if the inputs just don’t have the coverage, you’re still going to have gaps and biases. So it’s both kind of the the coverage of the datasets put together the linkage to actually like make them work correctly and then also. And then finally, there is deep the analysis on top of it, like, how do you actually know that you have all of that data? How do you make how do you use it to answer the questions that you have? How do you analyze it? And then and that’s why we have this like multilayer approach where we are getting the data and we are creating like a reconciled data product layer that has the comprehensive information. And then on top of that, there is the actual like more client facing applications that are then analyzing that data in different ways, depending on that client use case.

Josh Tell us more about the stages of that data flow and how you’re managing different ends of the process and what kind of tools you’re you’re using to get the data and to get it ready. And some it sounds like some sort of staging or maybe pre processing layer and then how it moves into your analytical system. Curious how you manage that?

Johannes Yeah. So that’s a that’s like a very layered setup indeed. So kind of starting from the beginning, that’s where like I do most of my work that. In the very beginning where we talk about ingestion, that’s a word that a lot of people use and they sometimes mean different things. So kind of what we have under like the ownership of the data ingestion team is actually several phases we have, like a by design divided that way. So the first phase is just extraction that we are. Most of the deliveries are coming in as files. So whatever that source location is, we call it external location. It might be owned by by us or by the source, but the data is delivered from. And the first step is just extracting the files that are delivered into our internal system internal location. And there isn’t really and in that layer, we have to deal with a different variation related to the two, the interfaces like the delivery the S&P or S3 is that. Well, in some cases we won’t even get like hard drives, which we usually costs us a lot of trouble. We would try to have cloud solutions normally, but. And even with those, there can be variation in like, let’s say, how we are authenticating, I guess three. Is it just like credentials that we use? Maybe there is an IAM role that has allowed to access. So all of that variation kind of needs to be handled just to extract the original file to that location. And they are we we don’t really touch the file at all. We don’t make any changes yet at that point. And we we just keep track of some metadata information or original file name or original delivery time to track that. And then the next you’ve got

Josh oh yeah, sorry, sorry to interrupt. Just curious how in the initial phase, it sounds like you’re dealing with a lot of different data sources. You’re dealing with a lot of different types of data coming in, a lot of different structures of data. How do you manage that layer internally? What kind of tools are you using there to just make sure that the data from the source is coming in as you expect it to?

Johannes Yeah. So that that’s a good call, and then most of that is actually handled by the next layer. So on this layer, the tooling that we are using for that, we are all pipelines are orchestrated using air flow and those interfaces we we actually have like built a tooling in-house. There isn’t really there’s so many different like kind of different variation to that extraction that we’ve kind of found it easier for us to to handle these nuances with something that we can then like that we have the full flexibility to build it ourselves as opposed to to use like off the shelf tool and then like try to figure out if it can actually handle like all of these, these use cases.

Josh So you didn’t you didn’t feel that there was a managed tool like a big trend out there that could have offloaded this work from your team. You needed to build up these processes internally.

Johannes Yeah, that is correct, and it comes a little bit more clear when we get to the to the next stage where we are doing actually the what we call like raw data ingestion and. We actually have spent a lot more time on working on that one, and there are come to the reasons why we are using like in-house tooling for that. And once that had been built utilizing the same structure, also for this extraction piece was kind of a much smaller push. So there wasn’t really that much benefit in searching for a different tool that would potentially serve test for that particular use case.

Josh Makes sense. Was there was there any way that you structured your pipelines or structured those extraction ingestion processes that have allowed you to scale up effectively when you’re managing all those different, all those different pipelines, all those different data pools? Because this is avoiding this is exactly the reason why folks lean on external tools, so you don’t really have that option. Curious, given you any best practices in making it feel more manageable.

Johannes Yeah, that certainly has been we we’ve felt that pain and we’ve kind of found out some solutions to that ourselves. And it’s kind of I kind of clarify that with with the system that we have in place for the for the raw data ingestion because that’s where it really kind of like nails down due to the approach that we have. Because now when we have just like got in that original file to our system and they are we don’t really care anything else except just getting that file in and keeping the metadata. But now the next step is to actually get it into into the like format that we want to deal with. And we are using Parker Files on S3 as our like data format of choice. But now we are delivered files in usually in different formats like they can be like compressed files or or not, and they can be CSB that we get often text files, market files or C files with files. In the worst case, like Windows executables, that’s painful again. But we are like there’s a variation, like what the original file even is before we can get to see what the data like in itself is. And that’s why we have this raw insistence that we have a couple of logical, logical stages. First, we take the original file and prepare it to a format where we can convert it to park. And these steps that might be happening in that that cases may be decompression might be needed. Maybe we need to remove some characters that would cause errors when passing it to parquet. So there are some, some minor steps that we do in order to have a file that can be easily converted to parquet, then we do that. We converted parquet and the metadata that we’ve been now tracking along. We are adding columns to the file so that we for for each record we can keep track of. Like what was the original file, when was it deliberate and so on. And then and then kind of a final piece of that is that after the raw data has now been created, we have a validation step where we check that it meets our expectations, like, are there any issues with this batch of data that we are now getting or is it good to make available for for the next step? And and this kind of gets to the what we have had to build in order to manage all of this for for a lot of different sources coming in is that we have very like systematic conventions like all of these different stages. We have like like a history prefix pattern where we are setting things like like what source it is, it’s one piece of it. Which environment is this dev or what data is this? Maybe we have multiple streams from a single source, which three days which stage of this step is. And we we kind of keep track of the files in these locations, according to this convention. And in order for our pipelines to easily do that, we have built like tools like best generators that they will enforce, that these conventions are met. We know where the data is going. We keep track of all of those locations when we go through. And one is kind of that framework has been put in place. Now it’s reasonably easy to like, add new pipelines to that framework in a systematic manner. But we did not have all of that in place when we started to, like, build the first ingestion pipelines to get our first data sources in and the and then it was kind of all over the place like that. It was very hard to track like where the data is, you wouldn’t really like, like know where to find it, and that was really messy.

Honor I was just going to say that sounds like a really intense stage within your data process is do you think your ingestion layer is the most data intensive stage at Komodo Health?

Johannes So that’s that certainly is one of the one of the big ones, because there is also like after that raw ingestion, what we still do under data ingestion too is then get from the raw data model that the data is delivered into a unified data model like, for example, in case of the claims data. If we have that coming from multiple sources, they all have their own use cases and data models that they use. But in order for us to use it systematically, we need to put it into a unified data model. And that data processing part is now very that needs a lot of domain knowledge of the data to understand that you are doing it correctly. There’s a lot of business logic in those transformations, and you might need to do quite quite a lot of like data processing in those steps. And that certainly adds to the like heaviness of the of the ingestion phase before it gets to this like next layer that makes the reconciled data products and then the third layer that would be more like client facing applications. But in those layers as well, they they have a reasonably large amount of data processing included in them, especially the creating the requisite data product. So that certainly is is also very data intensive layer. And then some of the applications are not on this list, depending on how much of that is done before and kind of final is available for the client. They may or may not be very like reasonably data intensive as well. So it’s kind of throughout the company where we have a lot of data processing going on.

Josh What were the biggest challenges that you faced as you built up this infrastructure and you have a lot of tooling in place to manage all these different pipelines? I’m curious, what are the events or the issues that came up that prompted the buildout of all these different areas of your scaffolding and your your infrastructure?

Johannes Yeah, I think a big driver of them has been kind of realizing the shortcomings of the previous like because going kind of giving a bit more context like this, like data ingestion as a separate unit was kind of started a few years ago. Before that, we were so small that we just had all the data engineers in one team and they are just doing whatever estate engineering and all through this time, the company has grown a lot and every year we look very different than what it was before. So now that kind of came to the situation that we started reasonably small at data ingestion that, hey, we have a bunch of sources need to build up some pipelines, get some infrastructure running and we need to do it like yesterday. So then it is like, OK, let’s get some pipelines running. We think about what architecture would make sense. Try to come up with the pipelines to get the data in and things are working. We get the data. Everybody is happy. And then we start to realize that, OK, more sources are coming. We didn’t perhaps have like all those conventions in place at the time, but the pipelines we need to build new bonds we need like new flexibility in like, well, there wasn’t even like convention that would be required to have flexibility. All of them are kind of like their special flowers at this point. And then we start to realize that, oh, OK, now we need to go back and change something on the first time because the delivery changed somehow. But now that looks different than what we are building right now. And then we need to onboard a new member and each pipeline looks different. And and there’s like with what’s going on here and then we are like, OK, this is not sustainable. We need to like have a have better structure here. And that has kind of been a bit like a driving force. That’s OK. Fine. Let’s like take a step back. Let’s go to the white board. Let’s think about how do we want to do that? And then we start to like usually then once we come up with the better ideas start to slowly build the tooling that moves us towards that final architecture where we want to be start to build the new pipelines, according to that and then trying to carve out time to then like get the legacy pipelines to follow. The same approach is to get all of those consistent and deprecate the old versions. And that’s how we have, like gradually built to the framework that we have now.

Josh I’m curious to dig in more to the validations that you called out as an important part of your process, but before going, there I am, I’m hearing what you’re saying about being able to manage all these different kinds of pipelines in a way that scales across a growing team. And we’ve seen different kinds of design patterns and practices to centralize a lot of the functionality that may appear across different pipelines as a way to help manage that scale. For example, centrally defining different functions that are shared by different processes, how data gets read and written across the data lake and into the warehouse, how operators or tools like air flow are defined, and more of a central way. So I’m curious to hear more about the this framework that you’ve built up internally and how you minimize the. The degree of change that it that may arise when you have all these different data sets and different ways of working with data across a big team like this, can you talk any any more about how you’ve built up this, this centralized feeling framework?

Johannes Yeah, I suppose a big like it’s certainly been an iterative process. I think that’s kind of like important to know this, but we kind of did it such that we introduced like like some unification, like when we started to have more and more. But we kind of had to do a bit like a bigger change at some point that, OK, now we have already had a lot of different like sources coming in. We start to see what are what are the patterns like, how do we need to handle? What are these logical stages that we can come up with? Doesn’t do. All of these stages make sense for all our sources that we have now. Do we have a reason to believe that they are defined alike in a very generic way that they can be applied to different situations, even different types of data? Would it make sense for those as well? Is there like a logical structure that kind of handles like more variation than than just a specific case? And after we have kind of like being happy with that. That’s OK. Yes, we see that like now, now we think that this this this does meet the requirements. Then we kind of start building that one. And and if there still is something that the next source comes in and something new comes, then we kind of think about that, OK, what is the new tool that we now need to create and is our overall architecture such that these tools can be added one by one or in a more like kind of incremental fashion? And I think the main part there is kind of figuring out those logical steps and then maintaining that logic very rigorously and then just adding more tools to handle different parts of that set. And and that has been kind of the key to success in that front.

Josh Interesting. So going back to the validation area that you mentioned before. What are the most important validations that you build with that built up? And what kind of challenges did you see that that caused you to invest in in this particular layer and where you chose to spend attention?

Johannes Yeah. So when it comes to validation, we have like different kinds of validation from different parts of of the data quality. And I think one, well, perhaps one of the most important ones and that we really have to have is that when in that raw data validation, when we have a new batch of raw data, we do a bunch of validation to make sure that there are no compliance related issues with it. And that’s somewhat obviously very important that if you we would have any unexpected data elements like we have de-identified expected to have a de-identified dataset. What if identified mobile information is something that would be in violation of what we expect the dataset to be and how we can have it, but like use that data, that is kind of something that needs to be captured very early to make sure that there is a small blast radius if something like that happens. On the other hand, when we go through all these steps to get the data to the final unified format, then we get to the point that the downstream data users are kind of interested like, OK, is this data now according to expectations that does match our data dictionary and that sort of agreement? Did we get it in a timely fashion would be another factor they would look into, and we kind of emphasized then this final shape part that we will check that the data patch that we are now delivering to to the downstream, that it matches all of these expectations. And we kind of internally verify that we have as good mapping from the raw data model to the unified data model, but kind of emphasize on that like final final Q&A before it made available to downstream because we feel that that’s the most important in that sense that there’s the pro of that approaches that the downstream user that needs the data will get the best possible. The con of that is that if we don’t validate at some earlier step, we might capture our issues only at the very end and then we might need to re process more. But that usually is our design that OK, let’s choose first to do the ones that affect the downstream and then we get to the ones that kind of make our life easier to up to three years earlier and then don’t need to repeat as much work when solving problems.

Josh Interesting. As far as today goes, what are some issues in data quality that you’re still experiencing or general quality of the overall system?

Johannes So in terms of the data quality, I think. Well, I suppose like a biggest like most commonly like found issue that we have that affects data quality is just deliveries being late. That that kind of doesn’t perhaps perhaps affect the quality of the batch itself. But just getting it late, obviously is something that our system needs to handle. And actually that is something that’s so common that we kind of had to change our system like a little bit like the logic how do we handle late deliveries?

Josh Because when you say late deliveries, you mean delivery from the data providers. So this might be a a hospital system, an insurance company not sending you data that you expect at a certain time and that actually arriving at that time.

Johannes Correct. Correct. That’s that’s what I mean, that the from the source, if we expect to get, let’s say, one batch every day and then this day we actually don’t see the data. And we are like, Wait, what’s going on? And maybe it comes the next day. They just missed it one day and then they deliver us the next day. That happens pretty often. And we used to have a system that we would tie often in. This like depends on the source, but quite few sources on that sort of delivery, as they indicate in the file name like what day this data is for which day for what day it is. So we used to have a logic that we would pick up that when we have a pipeline execution, it will take the execution date and match it to the file name. And that’s how we are picking up the right files as expected. But we realized that now we have a situation that if there is a late delivery the pipeline runs, it doesn’t find the file we would need to like manually rerun that pipeline so that it could pick up the file again. And we we changed the kind of solve this problem so that we wouldn’t need to have that overhead by changing our logic such that we pick up all the files delivered with a timestamp in a specific time period. That’s when the pipeline runs. It looks that OK yesterday. What? Where all the files that were delivered? Maybe it’s just the one that we expect. Maybe it was two because we didn’t get any on the previous day. But let’s just pick up all of them. And then this late delivery is like automatically captured according to schedule without need for manual intervention.

Josh We see different prioritization or escalation of these different kinds of issues, data delays or issues in the content of data. I’m curious why within why within Komodo this the data delay is still such a severe problem. Is it because that all other problems have been fairly adequately solved? Or is it that this this problem in particular, is one that’s just more critical and acute and causes a bigger impact to the business when that occurs? What why is this? Why does this remain a still a big challenge?

Johannes Yeah, so. And that I wouldn’t consider it, especially now when, like because of the number of sources, it’s not perhaps the most impactful. It’s just most common. So then from kind of the perspective of the data ingestion team that needs to think about how much time are we spending on solving this problem? It’s just very time consuming. But if we have several sources and one of them is slightly delayed, it doesn’t perhaps make a big difference in terms of the big picture of how much data per day is made available for downstream and not necessarily the most impactful in that sense. And when it comes to the more impactful ones, obviously, if there is like a bigger like, it’s not just a bit of a blip in the data delivery that there is like a longer outage, that it takes several weeks before some some major source would be back online. Maybe there is an issue on the delivery that might obviously have the impact, but then if there are like unexpected, like if there are bad batch of data that is being delivered, something is not captured in our system or there is some logical error in our transformation. And that would be like made available through this layers, then that can have a big impact on the downstream users, especially in cases where like, let’s say, there is analysis done on that data and that is already delivered to a client. And then we recognize that, oh, wait, that was actually a bad batch. We need to delete something now. All of that analysis will change, and it can be very difficult to then go back to do like changing all that analysis and like a handle that. I got like a client experience in these scenarios, and that’s why I’m going to have data validations in a lot of these stages to kind of capture things as early as possible. And if I would give like a one example, like a concrete example of that, like what has happened, there was time because we get but de-identified information. So if this so so we are using tokenized information for the patients. So now. There is a process during our ingestion where we are converting tokens from transit tokens to our internal tokens, which are the ones that we use to track the patients. When we get the information from different sources and now we had set up a pipeline for a source, we are doing the token transformation. We’re doing two steps. Everything looks good. It’s in production, runs one for a long time and then all of a sudden, our next downstream component. Creating the data product is like coming to us asking, like, Hey, what’s going on with this one? One batch of data that was reasonably large batch that they see that all of a sudden the number of unique patients is is increasing much more than what you would expect. And we are like, OK. It seems that in that batch, it claims that all the patients are new patients, and we’re like, That’s that can’t be right. Like, we would expect that we have seen vast majority of the patients already. And there might be some new, of course, but like even half of them, being new would be just absolutely bananas like it can’t be. And everybody being new can’t be the case. We look into it and we realized that we had been delivered the wrong type of tokens. It was still tokenized information, so our compliance checks did not rule. It wasn’t identifiable, but there were raw tokens. So now they were didn’t match to any of our current tokens and we had to like go back to the source to ask that they did that and they realized that it was a mistake on their side. They had to deliver that batch and we had to like change that data. And luckily, it was like data product downstream from us that actually know this, this issue so that it didn’t make its way any further. Because now all it like this like numbers. Numbers of patient is often very critical for a lot of analysis, and that’s sort of a spike in those numbers would maybe raise some questions. And then especially if we kind of have to dial it back afterwards, it would be very like a problematic situation.

Josh Right. How long does it usually take you to catch that kind of problem?

Johannes Yeah, that’s a that’s very like a. There’s a lot of variation that it really depends on, where do we catch it like at the moment now because like after that one, we, for example, introduced an additional check in our pipeline that we are whenever we get those tokens, we check that there is overlap with our previous tokens, our lookup table and only then pass through. And now with this approach, like at the very beginning of the process, when we get the batch, we start working on it. We’re talking about only hours of difference. And then we would immediately get an alert that, hey, this is not going through. So we would it would be very fast to find it. Then if it would go to the next step, then that might introduce like depending on the cadence of their update on their data product, it might take like several days or if they are updating more frequently, they might get it earlier. It might take days. It really depends on then on the downstream use case, like how much time it would take at that point.

Josh And you describe the the real remediation of this issue, since it’s coming from the source provider, it’s coming from your data partner, the source of the problem, the real remediation is going to that partner and asking them to fix something completely on their side. I’m curious how long it usually takes for that kind of remediation to come to you because you’re totally dependent on their team to be able to to move that delivery.

Johannes Yeah, absolutely. And that’s again, an area where there’s a lot of variation. It kind of a lot depends on the on the sauce itself in terms of other use to this sort of delivery, like if this is kind of their bread and butter operation, they have like, well, sophisticated engineering teams, they recognize the issue. They have their own ways to like, fix things. It can be that a few days later, we get the new delivery, maybe even the next day. They can be very responsive if it is something that they’re kind of more puzzled what the problem is, then it might take a little bit longer, but they tend to be pretty reactive. On the other hand, some of the sources might be like less sophisticated when it comes to their like engineering side, and they might take a longer time to troubleshoot. But that’s actually what this problem was about coming up with an like solution for this sort of situation. And then it might take like like a longer time for them to to then deliver that. So then all of a sudden one or two weeks might pass before they figure out that, OK, how do we actually like fix this problem?

Honor Do you feel that Komodo health has any leverage on the quality of these sources, or is there no direct relationship where you can actually influence the quality?

Johannes Yeah, that depends a lot on the on the type of partnership. So then what we’re in some situations we have like moral levers, we can pull others, not so much. We’ve kind of found that the very important part of that is the very beginning of that partnership. And when we are setting up the process, making the agreements that we are kind of like taking the driver’s seat to to like make sure that all the boxes are tick, that we are on the same page, make no assumptions. Always verify that we are on the same page. Otherwise, you get very weird situations and you wouldn’t expect that to happen. And so that’s kind of the important part to like, make sure that we are on the same page. And even in that case, there may or may not be like a way to to say that maybe they have our set in process and they just can’t really make any changes to it. Or maybe it is that they are actually like setting up this sort of operation more or less for the first time. And anything that we can kind of like, suggest and recommend might be actually very useful information for them as well. And and kind of as an example for that, like when a lot of these deliveries can have multiple files and it’s not necessarily always clear how many files, what should you expect there might be variation in how many files in a batch would be very convenient would be if there is like at the end of the batch delivery. There is a manifest file. And that will say that these are the files that you should expect, this is the number of records in each of these files. When you see that manifest file, you know that the delivery is complete so you are not waiting anymore. And then you can use that to verify that. Do I see all the files that they should expect? Do I see all the records that I should expect? And that’s something that we didn’t think about that at the very beginning when working with the very few sources. But then when we saw that deliver it the first time we were like, Oh, this is very useful. Like, why don’t we do this? And why don’t we get all these sources to do that? And most of them are perhaps not like trying and changing their ways if they have already set up something. But this is something that we have brought into discussion with some of the partnerships that they can. You deliver us this sort of manifest and then maybe it’s like, Oh yeah, they didn’t tell you about that. But now maybe they are able to do that when we bring it into the discussion. So kind of being active never hurts and asking for these updates. And sometimes if there is like a really in the content of the data, we may see some like degradation degradation in the end, in the quality of the data, we also might like really need to get into like tough discussions that, hey, we need to find out something like this is not what we expect and what it has been before. So there are some situations like that where we and we also then may be able to like have enough leverage to to increase the quality.

Josh What kind of SLA is? Does Komodo, as a data provider, have with your online customers? Because it sounds like there is some sort of agreement, you essentially want a guarantee from your data providers. You are also a data provider or insight provider in some form. Is there something similar that you’re guaranteeing guaranteeing to your end users or customers that helps ensure that that you’re sending data the proper way or down this value chain?

Johannes Yeah, so those are they’re there obviously is like agreements on that area as well, and that really depends on the like the client facing application, like what Typekit is like, is it if the delivery is like like a batch of alerts based on different rules that we have decided on. So then that has to be delivered timely and there needs to be an agreement in which format it is. And how is that? When should they expect it? And then in some other like applications where it’s more like allowing the user to interact with the data to find like like analyzing patient populations or whatnot, then that sort of situation, the agreements of keeping the data fresh and so on are like very different in that case. And kind of with our and then also, as I do than just in time, we need to have agreements with our direct downstream teams as well that for the next time, like what is the state of the data that it follows the data dictionary? What sort of expectations there is for event for each badge, for example, that it’s duplicated, that there are not duplicates within a batch might be an agreement that we have with them. And then also. Not so much like when we can deliver the data because we are kind of at the mercy of the source to some degree, but we might have an like agreement in terms of like how quickly are we acting to the data that OK, like when the data drops? Maybe we say that for daily delivery within 24 hours, you should expect to see the data that we want to provided as quickly if the data is not delivered. We can’t deliver it, but at least then we need to have an agreement that we need to. At what point do we need to notify them to give an alert that they they sent? That source is now delayed. Like, don’t expect the date that we are, you know, contacting them, working on it. But just so you know, the data is not being delivered, and that’s kind of then that communication is an important

Josh part of it. So you may have some agreement that says if data is delayed, it’s your obligation to find that problem in X hour period and to communicate that problem to your end client within a y hour period, something like that within that that formal or informal isolate yourself.

Johannes Yeah, that’s fair.

Honor Johannes, we’re coming up on time, so I wanted to ask for listeners who are part of data teams that are experiencing similar challenges like you do at Komodo. What advice would you give for them to ensure data quality?

Johannes Yeah, that’s a good question. So. I would I would say that perhaps like the thing that we found most useful, both for data quality and overall for for our like architectural system is to do like come up with these conventions and definitions and to build as much structure as possible. Because when there is that much complexity, it’s it’s easy to kind of do what we did at the beginning that everything, every pipeline is its own special flower. It’s doing its own thing. And that besides makes it very difficult to manage the pipelines themselves and make sure that the deliveries are going through. It also means that how do you handle the queue? A different data quality checks validations like you would need to take handmade that for all of the pipelines in that sort of scenario. But if you are able to put some structure in place that you have like more clear stages that are being followed and that logic is well defined now, you can think about that, OK? These pipelines are following this structure. When I create different validations and queue checks, I want the structure. I didn’t know how to repeat that for different, for different pipelines, for different sources or different data that is coming in. And then. And that also allows potentially even automating more easily some of that testing. So I think that that would be kind of a key player to get it and get the data quality up in such a complex situation.

Honor Interesting. When other teams are looking for for help with these kinds of issues and or shopping around for four tools, what would you recommend for organizations like yours that are dealing with lots of these external data sources that depend a lot on the reliability of these outside these outside data vendors for good quality teams that that look similar? What would you recommend for them?

Johannes Yeah, the the couple of. Pointers that I could give is that. It might seem somewhat obvious, but first of all, make sure that you have a very thought through list that what are the things you know, that your tool needs to be able to handle and especially like what water like non-negotiables that for sure, this needs to be covered and there is like no, no reason to like, move forward if it isn’t, and this could be it, it really depends on the use case. What that could be. Let’s say there is a lot of variation in the different file formats that you get on. You know that you’re going to get fixed with files and you need something that is able to to pass them. That might not be something that people encounter too often. I hope that they don’t encounter with them. I’ve had was not pleasant. So if that’s something that you know you need to be able to handle, then you kind of have to make sure that you are finding something that can handle these sort of nuances that you have. And then on top of this like non-negotiables, there’s probably some things that are more nice to have that, hey, we would like to have simplicity in this and that operation. And maybe, maybe not. These tools kind of like provide it out of the box. But depending on whether you would be like, kind of like, it’s good to then try to understand like what would be the roadmap of the tool, like do they already have in the roadmap to bring in some of these features that you have? Or is there any chance for you to influence that? And if you are, let’s say you are your client paying for the tool, like what is your partnership like? Do you have a chance to like influence what the roadmap would look like? Yeah. Could you potentially like? Be able to change the prioritization so that these features that would be nice to have for you, they might actually like make it to the tool in the future and make it more valuable to you. So then kind of understanding how those are going to be very important when you are dealing with a lot of complexity so that you you kind of. Know that they can handle as much as you need.

Honor Awesome, thank you. I’m sure our audience in the community is going to really appreciate the advice, not just hearing broadly from the experiences and challenges that you’re working on, it sounds like you put together a really impressive and robust environment to be able to deliver good data to your Instagram users. So thanks for sharing Your Highness.

Honor Thank you so much for coming on. This was so fun and that was so much information and such great wisdom, so I’m sure we’ll be replaying it a few times internally as well. So thank you again for coming on.

Honor Thank you so much. Bye.

Johannes Thank you so much.

Stay Connected

Sign up for the newsletter