Why Data Quality Begins At The Source

Databand.ai Director of Product, Shani Keynan, provides a fresh perspective on how to define data quality and how to control data quality when your data is in motion. Data observability is well understood to be a means to quality data. However, what’s often overlooked is the sheer distance that data must travel from the moment it’s collected all the way to data consumers. This means that data observability must be performed truly from end-to-end by starting right from the beginning (at the data ingestion layer) in order for data observability to be effective at all. Shani offers examples that illustrate how to make data quality an achievable goal and how to apply real-world logic and context create business impact.

Why Data Quality Begins At The Source

About Our Guests

Shani Keynan

Director of Product Management Databand.ai

Shani Keynan is Director of Product Management at Databand.ai. An entrepreneur at heart, Shani is driven by the pursuit of creating true value. He’s developed products in a variety of verticals from Bitcoin algotrading at Vidarmo to autotmotive data company Otonomo. At Databand.ai, he’s leading the product team to design a proactive data observability platform that buys data engineers valuable time to get ahead of data crises.

Episode Transcript

Honor Welcome, Shani, thank you for coming on to MAD Data to share insights on how to control data quality. Shani, just maybe tell us a little bit about yourself.

Shani Sure. So I’m Shani, I’m director of Product at Databand. I’ve been dealing with data for a while now and obviously here are data. But before I was product manager in a start up in the data world as well, internal data from people experienced firsthand how hard it can be to maintain quality of data and to maintain its usability, especially if you have customers and users that depend on it. I was an entrepreneur before and two completely different worlds a fintech crypto, and I also send a bit around business. So all in.

Honor All right. So you’ve definitely seen quite a bit in data. So we’re going to dove right in on this topic of data quality, which of course, has become a hotly discussed subject matter. And I want to use our time today to really break down the idea of data quality in a way that it can be actionable for folks who are listening. So let’s start by getting a clear lay of the land. So when we talk about data, sometimes it’s almost as if we’re referring to a static thing, but of course, data transforms. So it’s important to understand the journey. In simple terms, can you paint a picture for us of how data really gets from its collection point all the way to data consumers?

Shani Yeah, I think it was like an opening, by the way, your statement, but it’s really a fun thing to say and notice and data is something, and data quality as a result is something that changes throughout the data value chain. And I’ll explain what I mean by that. We have you mentioned the journey and the beginning of the journey. We’re getting it right. You need to get it from somewhere and if you’re lucky, you’re getting it from a lot of places. So if it’s only about yourself, you’re consuming your own analytics and business information maybe five, 10, 15 places. If you’re using data as part of your business operation and making it accessible to your clients to go to the hundreds. So getting the data now making it top together. So let’s say you get thrown data, you get it from Mississippi and you get it from Kentucky. Well, the roads there are pretty different. Maybe here they talk about one way and so you you needed to talk the same language. Then you and you put a new year after it got the data lake you ingested it, got the data warehouse, you broke it, you transformed it because you have a lot of analysts that want to use it and want to have their own specific views and got it dashboard in each of these stages. And obviously, this is like a simplification of how some work. You have a lot of different tools there, but in each of these pages, what we mean when we see data point is completely different. So I think this is like the road we’re talking about.

Honor So it’s a pretty long road and it moves quite a few times. And so we’re looking at data quality. Where where do we really control it quote-unquote? Well, how do you how do we actually like, is it, is it a matter of finding the right point in time or? Yeah. Give us your thoughts on which stage you think makes the most sense for controlling us.

Shani So obviously the worst cases watchdog, right? I mean, check your data at each point you’re going to be great at. But that’s right. If you have only several data sources, you’re not very better reliant. Yeah, it’s OK. You can test everything and check everything and you’re going to be good at. But usually when we think about how to look at data quality and it’s actually very, very different in different areas. So we have a lot of tools on when you have tools like gratifications that basically require you to a server each and every data flawed and a logic that really is your business throughout your data value chain. And you have tools that basically focus on where the data rests. They look at a specific table, they look at a data warehouse and they check that everything looks normal. And from that place, they trigger and say, Yeah, you have a something’s wrong here. You have a fresh color and you have tools that we’re also there, that one can do the above. And and I think this is where we decided to focus strategically, basically, say the point of inception, the first place the data gets into your system is ingestion. And they’re being able to basically look at the entire incoming data and know if it’s good, if it’s bad, especially if you have a lot of sources. It is critical because it affects everything downstream. So I think this is like where we decided to. I think it’s a good place to start.

Honor Mm-Hmm. Mm-Hmm. And what are what are the benefits of starting that early on in the process?

Shani So first of all, I’ll say it really depends. Different companies have different needs, so I can see and when we talk to users, we’re very straight about it. We say, OK, it might make sense for you, for you to start starting your warehouse. Just look at your warehouse, make sure everything’s good and still be okay. But when companies are data, so they’re using a lot of data sources that actually they don’t control. So we can talk about examples. I imagine later that they have all these data sources that are coming in. It can be in the tens or hundreds. Their ability to control it all and understand what is coming is the volume for each source. This makes sense if they have a data source that is five percent of my data and somebody can arrive. And that’s a hard thing to handle. And if you want to see what’s happening downstream, if you want to know if one of your cables downstream breaks what you can do it by looking at the table and you can do it by looking at the ingestion process. So if you know what, when you and just the data, then you’re going to know faster. If you’re going to an old fashioned is something, you’re going to have more time to react. You’re going to have the ability to, once you find out, know what happened. They’re going to have the ability to control for the next. So it’s not just something broke. Let’s handle it. Let’s try and fix the data, and we’re going to invest all this data engineering time to fix the data. But it’s more like, OK, we can see that something is going to happen because this ingestion source is. Looking problematic one. Let’s take care of that, too. Let’s make sure that everyone downstream know that this is going to happen. And three, let’s make sure it doesn’t happen again. The other way around is much harder.

Honor So would you? But you’re not saying just look at an ingestion. You’re saying starting there?

Shani Yeah, definitely. Like, it’s really hard to look at each of these places only if only one of these because they just got out. Like, we’d like to talk and hear a lot of users talk about speed and movement, which has been addressed. So looking only at the end point over your data or looking only at the movement, it just doesn’t make a lot of sense, right? Because the data is moving from one place to the other. If you don’t look at the origin, you don’t look at source you haven’t processed, right? Is this a lot of data? Is this little data? How does this scheme of this data, the distribution of this data frame compare to my database the place where that fuels my business. So you need to look at it as things. You need to look at it as addresses and you need to be able to look at the input as well. But the ingestion? Yes, I see it. The data vendors place that is very, very good start because we see a lot of in there.

Honor So we want to look at the basically the entire journey like start to finish is the ideal.

Shani Yeah. You look at the ingestion, you look at the road, you look at a place where the data rests and it keeps going from there.

Honor That sounds like. So that’s that’s a lot of ground to cover. Would you say that it’s always necessary to monitor the entire journey?

Shani Well, no, it will not. I think it’s necessary to look at the places again, it depends on the company. So a very data intensive company that has a lot of data sources not looking into the journey. It will just cause issues, OK? And smaller companies smaller databases were not as numerous databases, just few. Well, it can work. And also, like anything in life, there’s a tradeoff. Tradeoff is what I do. So there are places if you’re nothing, say nothing bad of a government, but if you’re using a lot of government API. From our experience, they change a lot, and they don’t necessarily tell you that when it happens, so covered the hard, hard ground first. And yeah, you can move on from there, but it’s not always relevant to cover everything in one shot. Mm-Hmm.

Honor So it sounds like it is pretty use case specific on when that is especially needed and maybe when you can actually get away with maybe not looking at the entire journey, but help me visualize this a little bit. This movement of data by I want to look at some real world examples. So if we were to look at, let’s say it were to compare the difference between like an analytics company and like a Zillow that uses external data sources as their business model, what’s the difference and what we’re dealing with in their data?

Shani So let’s start from last start from zero. So their business is basically they’re in the discovery business, right? They’re helping me discover relevant things about the assets that I want to find so far in that case. So their main data that they actually it’s there’s probably because they have people listing it is where the apartment is and the apartment stuff and the fact that it’s for sale. But if you go to his apartment and below you’ll see right there, get schools nearby. However, the roads there are the neighbors, OK? What’s the average education, eight years of education around that? And when you think about that, that’s how they’re dating. So if Zillow gives service in 50 states in the US, they probably have a data source for roads, for each one, maybe and for each of the data types I and I mentioned before, they have a source they need to know from somewhere about the schools, from somewhere, about the neighbors at night, how they do that, and maybe they need to know about the fire department. We’re talking about per state two three four data sources. Easy. That’s going to be someone at Zillow managing 300, 400 data sources. So there are going to be open systems that are going to be government. Some of them are going to be state that is hard to manage. And the only way they’re going to know when something’s missing. But if someone is trying to rent an apartment in Kentucky using Zillow. The only way they’re going to know that they can receive school data from Kentucky is that if that someone is going to them, there’s no way to know because they just don’t have enough mass there for it to be significant. So this is like a business that heavily depends on external sources. It’s part of its ongoing business operation. I have a dog and a cat here. We’re going to hear in a minute. Then fighting in seconds, so I apologize.

Shani I’m getting I’m getting the looks of it. I guess so, yeah. Before that happens. And let’s talk about the other business. So you’re going to have a business that you’re basically a smaller company. The data you’re consuming is basically around a lot of data. Brad, you have your aptitude mix panel. Yeah, Salesforce. You have HubSpot. You probably have those pre-made connectors by they can to go directly to your warehouse type or segment or buy any of those. And. So having data, it’s not that data is not. Fueling your business, OK, it is fueling the analytics, but you control a lot of it, right? You like the app, the data is your own data. You know how the scheme is going to work. They’re trustworthy company. It’s going to be good. So these are two cases that in one, I would say you should definitely start in and just end it if you don’t like something’s not right very well, then. And in the other one? Yeah. Ingestion is a good place to start. It’s always a good place to start, but you probably have all kinds of other things that can go wrong and you probably have some problems as well. Mm-Hmm.

Honor Yeah, I know that does make sense. So it seems like whether you own your data sources or not, and regardless of what space and whether it’s a business role that the data directly plays, there is still a huge benefit to starting at ingestion. So if we were to look at this in terms of, we hear a lot of talk around causality and lineage, and it’s something that comes up every now and then. These conversations is what’s really what’s the value of knowing causality and lineage and a use case where, like, isn’t it? Isn’t it all the same when things break? Like, what is the benefit of knowing where it happened?

Shani Well, we live in a world where each up, each pipeline, each data source can affect a lot. A lot of a lot of different data sources. A lot of different data tools, data products. So. Seeing that one source of growth upstream. OK, so it’s different, like the road data source broke, it’s going to affect my maps, it’s going to affect my application, it’s going to affect a lot of things. The ability to say, OK, this was just something happened. And here are the three databases here the three dashboards. Here are the two views that are using this thing. This is very significant. I’m going to save a lot of trouble to a lot of people. I’m going to be able to fix it because I know what the sources I know. What’s happening might actually have someone to call and say, something’s wrong here. So it basically orients the entire thing. So the flow is, you know, that something happens, you know, the impact. You know who the relevant people are because each of these data, internal data tables or views has been formed, hopefully within the organization. And you can start preparing it and you prevent the damage.

Honor So preventing the damage and that gives insights basically on where and how it affects these issues.

Shani Yeah. Yeah. Knowing that source, knowing it, the ingestion basically gives you the ability to if you have your DNA structure, infrastructure map, you know what the source says?

Honor Let’s think about this in terms of trade off, right? I feel like the idea of starting an ingestion makes a lot of sense because it’s logical that prevention is the best cure. But I also want to put that into perspective of like, what is the sacrifice? Is there any trade off like using your language as a PM? Is there any trade off for the business to look at it that early in the process? Is it actually harder to do there in terms of effort required than doing it at any other point in the process?

Shani And there’s always a tradeoff, always. And I think we work, we datamart as a company, we work much harder to accommodate different kinds of customers because let’s think about it is that the data we have all kinds of roles, right? You have like the Pro that can only go with four by floor. You have like this huge highway you have like the walking path and each of them has its own thing, its own. You need it your own way to integrate with it and know what’s happened there because they have these different properties. A system for our customers is usually very complicated, it has all these roads and we need to be able to accommodate that. So we found a solution. We knowingly limited a lot of our potential customer base at the beginning to say because we want to focus, we want to be able to give the most value to the most people to have the biggest problem. And so to make it easy, the trade tradeoff here is that we are okay. It can be harder for us to basically accommodate all these different types of customers. If you integrate. Only of the warehouse level on the database level. How many of those are there? I think there are several databases, you have three, two market leaders in the data warehouse said it’s easy to just do it. And so the trade off is ease of integration on the one hand and the data warehouse side. And you have. Basically, the ability to. Accommodate different types of architectures. It becomes much, much harder on the ingestion side.

Honor But that the harder is that harder on. The business is harder on the engineers, setting it up as a vendor or as a target,

Shani and that’s basically OK, it’s hard. Yeah, we need to make our decision. We need to make a decision, say we currently we want to work. One of our one of the reasons for our tight integration with Apple, for example, is that airflow is a great tool to and to get the metaphor of great traffic cop. It looks at all the roads, it knows everything, and that’s that’s a great thing for us. It loves to accommodate and give us a lot of customers at once. And this this was a big decision as a company, as strategy. And so that gives us an advantage and this enables us to give value. So the trade off is that we work harder and

Honor harder on you, right? Yeah, we’re right on me. So I have one last question and I want folks to be able to take away some concrete tips. So let’s take this taking this conversation outside the context of any specific products. What are some action steps that any data engineer, regardless of what tools they decide to use and what can they immediately do to start taking control of their data quality?

Shani So I think a good way to start thinking of your defining layers, OK? There are different layers of quality at different stages. So first of all, treat it as such. I’m not talking about if you’re doing it in the past. OK, so you’re do the test of each of these layers you do to test at the warehouse level, the transformation level, the ingestion level of the grouping level. And obviously it’s hard. That’s definitely a first step and understand and internalize with any organization that quality in each of these stages is different. So this is one tip. A second tip would be. Also have different. There’s a body, so a very, very initial requirement would be don’t even don’t even look at your business like. If data should have arrived and it didn’t. That’s a problem. That’s it, it just is. If X data came in and half an X came out. That’s trouble, or maybe not, but you should check this is the kind of anomaly we should notice if a scheme of an internal and external source changed, obviously hard to manage and hard to look at, but this is something that might cause down the road data to break. So also something to watch. So this is like a basic metal level that does not look at the data logic and the business logic itself. That higher level would be to say, Are there any specific is there any specific knowledge that I have on my data that I need to be very explicit about? They find that you find way, people. I have a weighing weighing app. Well, if I have like over a thousand pounds, someone weighs over a thousand pounds and I have people that’s starting to weigh a thousand 2000, 3000 pounds. Something’s wrong. And this is not. Data like magic can help. It’s only your business knowledge and your domain knowledge. So find those things that make sense for your business and cover them both in terms of distribution and different from different data sources and in terms of your database itself.

Honor It sounds like it’s almost like taking a common sense approach to assessing. Does everything look right, like based on your

Shani uranium in a secret? It’s much, much more complicated, and a lot more than a simple person would never be able to get some government.

Honor OK, awesome. Well, show me this was so helpful. I really appreciate you coming on the show and sharing your insights. I know that we also have Abu Dhabi and has an open source library that could be of use. Can you tell us a little bit about that? How folks might be able to use it?

Shani Yeah. First of all, go to our library and look at it. But basically what it allows you to integrate, it basically logs into, especially if you have airflow, but it’s not mentally logged in your airflow or into your shame. All the information that is relevant on your data can be that different and your execution. So how long it took to run a specific pipeline, the different tasks. If you’re logging specific data, the amounts basically allows you to play with the different stuff that you’re getting in metrics. You get out of data and represent yourself.

Honor Very cool. All right. Well, thank you again. Sure, I’ll see you. Bye. All right.

Suggested related links: