Honor Hey, Harper. How’s it going?
Harper Going great, just rocking that data day. You know, like every day, really Saturday. How’s everything with you?
Honor It’s pretty good. Can’t complain. Excited to chat with our guest today?
Harper Yeah, absolutely. We’ve got Vivek here from a going to talk to us a little bit about data quality at scale. So Vivek, why don’t you introduce yourself to our listeners and tell a little bit about your background and how you got into working with data and data quality?
Vivek Yeah, thanks. I’ll pass. And I work as a senior software engineer at Netflix. And part of the machine learning platform team where I’m working on the recommendation engine. I work on the recommendations providing data for other recommendations. So there have been dozens of frozen petabytes of data trying to give us members the best recommendation experience. And at Netflix, the data tends to be simple, but at a much larger scale, a patient that on the recommendation side, that data tends to be simple. But at a much larger scale compared to the companies that I worked at. Before Netflix, I was at Apple, I was on the Maps team, I was working on ingesting data that was coming from vendors. Vendors like Tom Tom. This was the geo data, and this data was much smaller compared to what I work on Netflix, but much more complex. I’m talking about thousands of attributes and tens of different names. It contained information about roads, waterbodies intersections, speed. So that’s the that’s what made up thousands of attributes. And before it happened, I was at Amazon and they steam and out there, I was working on the server team where we were managing the customers. Well, I didn’t. The customer data was totally obscured, but we were dealing with that in large scale metadata at that point. And we are managing thousands of hours, and we were looking for anomalies on those hours, and this was back in 2011 before either the anomaly detection became pop and it so.
Harper Yes, that sounds like a lot of fun. It sounds like you were kind of on the forefront there and kind of understanding how we can apply anomaly detection to to data quality and sounds like a lot of really interesting experience to see whether it would be at your current role in Netflix or Apple and Amazon before this. I’m curious, how do those experiences working with data at that scale and specifically focusing on the quality of that data? How are those the same between those experiences and what differences did you see between those experiences?
Vivek So the similarities that I find with the data quality are that data quality is about understanding the use case. It’s very easy to say that we want perfect data, but much harder to achieve that. It’s easier to define what good data means, what good quality data means for a particular use case and focus on achieving that. Let’s talk about the Netflix example first for the data that might be managers of said, if I did, I have an abilities on the ninety nine point five percent well within the team. We try to achieve ninety nine point nine or ninety nine point nine nine percent. And as of now, we are at ninety nine point nine ninety nine percent. Yet we have not changed us. So we do not need to change it. Our customers don’t need that much. They don’t need it to be above ninety nine point five percent. So we can focus on other things. And the interesting thing is that customers don’t mind having some missing or duplicate data. But it is within the data at a very different story. They really care about not having errors within the data, so it’s fine if we train on a member state that place because we are training on so much data that even if we end up training on a federal member state advice, it’s not going to impact the audit. But if there was an error within the data, that would have a significant impact. Whereas let’s go to that one thing that happened that’s missing or duplicate data could lead to no highway one, one or two highway one on ones. So either I cannot go from Mountain View to San Francisco or I may have two different routes that that’s confusing. These are both bad scenarios. We don’t want to be in that scenario that we don’t have a halo on the one. So missing duplicate other data was a strict no. But the freshness of the data was not that big a concern. I mean, it was a concern, but the correctness mattered, baby more. Pipeline that would feed was more acceptable. But shipping bad data was not.
Honor If I can jump in really quickly just as Netflix customer at Netflix, do you feel that like what happens if the data is not hitting ninety nine point ninety nine percent? Like what? What is the outcome of a missed data SLA?
Vivek Missed data so that it does not impact the model that much. Let’s say let’s take the example, as I said, what would happen if we don’t train our mind and on one member’s data like we are, we are training on. We have more than 200 million members and then we are training on it somewhat. It’s might be training on then. And that’s like pretending 120 million at that point. If we have 20 data for 20 members missing because of it. And this is very random. It’s not that it’s the same member. It’s due to any kind of issues in the pipeline. It usually happens that we we end up missing a member’s data randomly that that’s not a big concern unless we have. We are missing the same members data every day, every time we are training. It’s not a big concern because we are out of we have other 20 million members to train on this particular model right now. So the impact is very low.
Harper One thing you mentioned earlier when you were talking about data quality is that you said that it really comes down to understanding the use case, and I think that’s a really salient point for anybody working in data management and data quality. And I think it’s something that anyone outside of that purview may not necessarily consider because as a data engineer, you have your own perspective. Unlike the like, the quantitative quality of data like I can tell you of my values are correct. I can tell you my course fits my schema and the model is as I expect it to be, or if it appears that it’s starting to over fit, et cetera. But without understanding the qualitative nature of that and engaging with the either the users or the stakeholders in that data management lifecycle, it’s hard to really pinpoint what the like the source of truth would be. If you are, it’s something we talked about on another episode. So I’m curious from your experience, like how important is it to work with the stakeholders, whether that’s on the business side of the product side or the users of a product like Apple Maps or Netflix to understand that use case and get their feedback? And then how do you go about translating that, that feedback from the users into what you would consider like that source of truth? Like what does that process look like?
Vivek I think the detention is getting enrolled into the requirement gathering is very important. It’s having those conversations with the stakeholders that what impacts the Martins, what are the they are consuming greater data at lots of us, not a machine learning model that was the data was consumed by the mapping team to the application that you see on my phone is the one that’s consuming that data. Having that requirement very clear that by what it means when they ship bad data to them, that’s what provides us clarity on what we need to focus on. And something they get Netflix, the freshness of the data matters so much that of data quality checks run in patented B, we do not run a data quality checks like we are. It’s not on the critical path. We ship out the data and then we keep running our data quality checks on the site because in case there is something bad, we can go to our customers and tell them, Hey, stop this, we have we had we had a bug in our code or something, and the data is bad. But at Apple, that was a big no. You cannot ship our data because that might need some major issues downstream. And so if we did not had those conversations with our stakeholders, we would not know that they’re fine with stopping the pipelines because our pipelines are going to any mistake that knowing that if we run our data quality checks for two hours on the side, that pipelines are not going to complete in the meantime.
Harper The distinction that you point out there between your experience at Netflix and Apple Maps, it makes me think about the conversation I’ve had with some of my peers about whether you want to take a proactive approach or a reactive approach to data quality. And you know, you talk about Netflix here, it’s fine. If you run these pipelines and you ship this data out and then you run that data quality on the side to understand what the issues may be and then adjust from there. And that’s that kind of reactive approach. But it sounds like at the Apple Maps experience you, you really wanted to take that proactive approach. Do you think that that’s like, what do you think? Guided Netflix towards that reactive approach versus Apple towards that proactive approach. Is it company culture? Is it the use case?
Vivek I think it’s the use case when we say reactive or proactive. It sometimes does sound like that at Netflix. So when the data goes bad, that’s when we’re reacting. It’s not like that. We are just understanding our use case that the freshness is being more important. We understand the use case and that if Bush about the data and we we provide signals to stop the violence before the bad data that gets consumed and the more that gets updated, it’s fine. So we are proactive in terms of understanding the use case at Netflix that, hey, freshness matters more than running data quality and we run them every time, but we can run them on the side. It’s not that we are running them on the critical path.
Honor That makes sense, makes sense, yeah. Do you think that with every use case, like what’s what kind of? Deliberation goes into deciding on the tradeoffs of where data quality should sit.
Vivek And that’s an interesting question. I think it’s depending on what does the company really want, what’s the primary motive for the company at Netflix? It’s about providing the best experience for our members, and that experience means that we are training on the freshest possible data that we want to train on the latest data. That’s how we can find the right training now. My dad’s and that’s all we can give our customers the best experience at maps. It was about the customer experience, but it was about accuracy in the data. It was about somebody not having an issue going from Mountain View to San Francisco. As I said, it don’t want to be missing that one on one. Even if there was an update that got missed, of course they did buy a few hours. That’s fine. The Echo, whatever picture we have of the work, it needs to be accurate, even if that is 10 us behind. Yeah, that’s not the case at Netflix. And at Amazon and at Amazon, it was about being. Invisible to the customer, like when when you have an iPhone, you don’t think about the storage and you’re using iOS app and you’re taking pictures, you don’t think about where that data is getting stored. So Amazon in the EBS team, it was all about that that would be invisible to the customer, that customers should not even realize that there is a need for them. It just looks so. I mean, you just
Honor keep upgrading, right? That’s what I get. Like a story? You just get 10 videos today of your food.
Harper Like, Yeah, yeah, buy a
Vivek new get get good gigs. Yes, exactly.
Harper Well, one thing of that here, too, is that like it comes down to like the context in which that that data is a describing like what entities are describing and then the context that it’s being used as well. Because as I touch on a little bit earlier, like that proactive or reactive approach, I think it’s important to remember the context here that you’re talking about from Netflix is the recommendation engine, right? So as the recommendation engine, it’s OK to evaluate how that model is coming back and then improve it iteratively over time because the customer isn’t going to be mad if you give like a like the worst recommendation that is going to go like, actually, no, I don’t want that. And then that teaches your model something more, whereas opposed if you were working on. I assume if you’re working on like the livestreaming team, like the video platform itself, you’re going to have a more reactive approach. And that’s kind of where it sounds like the Apple Maps kind of aligns more with that context to where it’s almost like you can think about it as like the closer to like near real time information delivery that needs to occur with your use case, the more proactive you need to be with the way that your data quality works, because the correctness becomes more important whenever you’re talking about near real time use cases. And so I won’t definitively draw a line between like batch and streaming processing, right? Because I think that those don’t perfectly aligned between proactive and reactive. But context really is important, and that’s what you’ve been hitting on here with the with the use case scenario. But it’s really interesting when you when you talk about all these companies and how they can kind of see it a little bit differently, right?
Honor I mean, I do want to push back on that, though, Harper, because I’m hearing what you’re saying is that it really isn’t a distinction between proactive versus reactive, because that’s a black and white because I think what I’m hearing is, yes, there is this idea of a proactive approach that makes sense in certain use cases. But then there’s also this idea of there might be one specific dimension of data quality that is identified as the most important for specific use case. And it so happens that in prioritizing that, it doesn’t require a proactive positioning. Does that sound right, Vivek? My understanding of that?
Vivek [00:16:47] That’s absolutely right, that there are so many dimensions that we all from which we can look at the data quality. It really matters that we actually end up prioritizing in those. So it’s absolutely the right thing and we find it
Honor hard to disagree.
Vivek But actually, Netflix encourages within the team. We encourage disagreements. We encourage dissent. So I’m very used to that. We are looking for the counter opinion to get the results
Honor of that so healthy. I love it. All right. Let’s go, Harper.
Harper We’re selling tickets to it. We’ll be selling tickets to our our verbal sparring boxing match later and later in the night. But I think I love hearing that about the Netflix culture, right? I know that’s something that the company is kind of known for and like being very open and honest about the culture. And I totally agree when it comes to the only way that you really grow as an individual or as a company, or even as a philosophy, like having someone ask questions and counteract the the troops that you were putting out. There is the only way that you like, refine those thoughts and get it better. And that’s that’s all part of the data science process, right?
Vivek Absolutely. I am. Unfortunately, I cannot disagree on that comment, but
Harper I quite agree. So I’m curious. So like focusing on like like the Netflix experience and we’ve talked about whether it’s proactive, reactive, like data quality is important. We recognize that. But what are the what are the kind of like data errors that you think like customers really notice it at Netflix? Like what are what’s like the the big red flag that’s going to keep you up at night? If if it were to kind of show up in your processing.
Vivek The big red bags and dictators that mostly around drastic changes in the values. So as I said, we’re not concerned about missing or duplicates. They know they’re missing or duplicate news. It’s the errors within each record. So if there’s a big change within a particular value, let’s say default values suddenly starts. Going from that to empty string are not value to a zero. Or there are drastic changes in the category, said Platinum. These are the bugs that our customers would notice. Let’s take an example and have off of that if for some reason we start ingesting data from the new members that are coming into the service. That would mean that all of that would impact several different continents because of viewing history, which is studying the members what what they viewed. That’s a very important data set for us of that data would stop getting any data for the new members. And now I want it is getting trained on. The young members were joined at least a day ago or two days ago, and that would mean that any new member that joins we are not going to be providing them the best experience because we never trained on that data. Now it’s been five days and we have not trained on any new members. So now the best data is for the new members as from five days old experience. And that can have a significant impact on the experience of the members and yes of customers, because our customers are going to notice that any new member that’s training it would notice it.
Harper When you when you’re considering these, these errors that are going to keep you up at night, right? And how how you avoid having these shifts in data values occur or how these shifts and parameters that are major shifts in the parameters occur. How do you balance that with trying to innovate and try new features or new models or new scenarios to improve the engine at the end of the day? Like how do you how do you how do you strike that balance?
Vivek Oh, yeah. So I think that’s where the monitoring and the data quality checks come in. My team is responsible for providing the data and we are not into the training, the model of how does or how does that impact the member directly. So my job is to make sure that the data that goes into training these markets is of good quality and there are no issues with the data so far that we have the three different kinds of checks that we do, and we are able to detect any kind of data corruption by slicing and dicing the data into multiple ways. The first one is by doing the aggregations. Second is consistent something, and the third is something that’s scored on three of them. So we are dealing with a very large scale data and we cannot directly work with that. We cannot validate each and every value in that. So to check changes in the column, we aggregate the data. We are trying to find out at the quantum. What’s happening and we aggregate those and then we’re looking for the historical patterns. Let’s take an example of thumbs up ratings that our members are giving. That there’s a pattern there that this much percentage of members are giving thumbs up on a daily basis, so we expect that ratio to remain nearly the same ought to increase over time, but we don’t expect it to be a sudden increase in X or suddenly decrease 10x. So if we can aggregate on the thumbs up and find out how many times ups and how many outcomes down in that case, so we are OK with what changes, but any significant changes? It’s going to X off going to HA is something that we would start noticing and we would add that it would get a page that had this value does not look good. Why is it happening? Because of that? Because of these aggregations, we have been able to find issues without upstream services and that has had this keep the model quality and good. The second one is the consistent something, and this one is more for detecting the bugs within our pipeline and within the pipelines of the class of aggregations was to detect the issues that happened in the upstream.
Vivek The consistence sampling is for. US and our customers, so we have a shared something strategy and where we have something a very small percentage of the data, our pipelines and our customers by phone can run their candidate against this data and compare the output against production. So now we can deploy code changes against this shared time Sanford data and compare it to production to detect any kind of changes. Our code bugs that can happen. So any written code changes can be detected because production is going to have a different output compared to the country in case of any bugs. And there’s a third one, which is the random sampling, which is the most happy go in every state.
Vivek That’s a more happy go lucky state because a significant percentage of the data that passes to the firemen is never concealed. That’s a pet peeve of mine that we are studying way more data than we need to. So, yeah, and in that case, if the customers are never reading it and we are storing that data, we don’t know if there’s a bug in that one. If the data is bad that when the customers do start asking for that data, the five might start waning and to keep that, keep that data in good shape. We randomly sampled our data at a very low percentage and believe it or not, we have found issues with this random sampling just by hitting a data. Still randomly, we find that the issues we cannot read the data and we are able to capture shows before our customers, before the downstream pipelines stop consuming.
Harper Are those issues unique to your team, or is that something that you see kind of across the entire data organization?
Vivek I think it’s mostly in the in our code or in the upstream. So if there’s an issue in the upstream that’s happening at a very low percentage, we do catch it using the sand on something strategy because the consistent sampling might be evading that. That’s how consistent something is at point one percent or 0.01 percent. But this is an issue that’s happening in the upstream, as is happening at point zero zero zero one percent. Now we are able to cash that with the random sampling because we just do random sampling every so often and we we are still able to cash.
Honor That’s really cool, and I really appreciate the just the walkthrough on how the ratings play into this. Given that kind of framework, precision isn’t as important. And so you can. I don’t even want to say get away with random sampling. It’s more like it’s a perfect setup where this is a very easy way to actually check if things are correct. Do you find that in mature pipelines, there have been any scenarios where you’ve been surprised by how they behave because it seems like you are able to predict behavior pretty well. We are curious if anything has actually caught you off guard.
Vivek So in terms of the behavior of the mature pipelines, it’s almost mostly how wide reaching the implications are. Once we have a mature pipeline, how many data engineers and researchers want to use it? I don’t think that any surprises that we are trying to avoid the surprises. That’s the job that we have. The biggest surprises is that once we have a good system, a system that’s easy to use and provides good quality data that how many different ways of researchers can use it and to improve the member experience in terms of just the data, I don’t think I have any kind of surprises as of now. But and I’m hoping to keep it that way. I’m hoping that it stays that way.
Honor I don’t just say right, right?
Vivek Yeah, that’s the
Harper that’s the prayer to the patron saint of data for every data engineer when they start their work every day. Yeah, it’s really interesting. I I love hearing how you think about data quality and the different use cases and how you are trying to address different issues and errors that exist within your your knowledge and experience of data quality and ensuring that it doesn’t come up as a surprise in your current work at Netflix. I also can totally relate to your comment about having a ton of data that comes in and then gets stored, but then, you know, isn’t isn’t used or isn’t isn’t utilized as much as I may want it to be. Right? There’s a I find one common experience across the companies that I’ve worked at is everyone wants to capture as much data as possible, even if there isn’t a particular use case for it in the moment. And in theory, it makes perfect sense, right? Because as you grow and as you have more capabilities of the new data science team or your machine learning engineers come in and then you have this historical information and you have this wealth of knowledge that you can potentially extract more information from. But as a data engineer, you’re just looking at this, these petabytes of data and you’re like, OK, how do I make sure this stays maintain? How do I make sure it stays fresh? How do I make sure that access is always available? And it’s not a bad thing, but it just adds to the complexity of the general curation and stewardship that exists as part of the data engineer role.
Vivek I completely agree with that, that we have made it that more data is good, but we have forgotten, forgotten about the secondary costs associated with storing more data that are not of secondary cost data access becomes no, it becomes harder to iterate on it. It becomes harder to quality. Check that logged enough data. So absolutely, we need to start thinking about maybe not storing as much data so that we don’t need more data from storage every day if we don’t have to upgrade it every year.
Honor Apple would disagree.
Harper Yeah, it’s like the the unintended consequences of like of storage becoming cheaper with the with the advent of various cloud platform is being able to make data storage and access democratized.
Harper I like what you mentioned, though you sounded very forward thinking in your last comment there, like thinking about how we should consider storing data and capturing data. And I’m curious what other thoughts you may have about, like what’s the what’s the next frontier when it comes to data quality? Like where where do you where are you interested in kind of poking around and seeing how data quality can improve or how these processes can improve, or how we can change the way that the industry thinks about data quality?
Vivek Yeah, I think so. Well, one of the problems that I see with data quality is that how fragmented the solutions are. I know it goes against what I said, that we need to focus on data quality as a by the use case. But at the end of the day, I think those use cases can be bucket sized. But these are the 20 different use cases that may not be that may not be that hard. At the same time, over the last 10 years since I’ve worked on data and at various companies have never seen two solutions to be submitted, I don’t remember seeing that two solutions that were even remotely submitted. It’s so fragmented that every time you have to start, start thinking from scratch that how do I build a solution? Think about other areas of software engineering. That’s a continuous integration. Yeah, we can use Jenkins. Think about the word solutions. There’s great and there’s MAVEN. There are three other things what ideas we have intensified, and these are the few things that we can talk about. And I talk about data quality. What comes to your mind? There is nothing in there. Hey, why don’t you start this? You can have 20 different tools for dashboard, and you can have that. You can write custom checks of how do you validate that data is good. You can have custom jobs to compare two datasets. Even the comparison data comparing two datasets I’ve seen probably around 50 different jobs over the last 10 years to just compare two different datasets. So there is no there’s nothing in standard. There’s no industry standards on how we can do better data quality solution. That’s the first thing. And the second thing is that data quality tends to be an afterthought. And even now, the version one and version two of the five pipelines won’t be thinking about data quality. I’d be happy if they are part of the version three or four, if, if any, so the data quality is then an afterthought in the industry. That’s what I that’s what I have seen. So I think these are the two off of frontiers that we need to start thinking about data quality of neon because the investment and data quality pay off significantly and hopefully we can have some kind of an industry standard solutions for data quality.
Honor And this is something we hear a lot to across three different conversations with our guests about how do we actually achieve data quality, this thing that universally is agreed upon as important. And but when really it comes to how do we do it? It’s terrific. It involves a lot of moving parts. So and we always this is how we always wrap every episode as we ask our guests for a call to action, a tip on what would be the one thing you would recommend folks to start implementing on their teams in order to achieve data quality. Like maybe I won’t limit you to one. If you’ve got more than that would be, we’d be happy to hear all of them. So what would you recommend for teams that want to implement data quality
Vivek for the teams that want to implement data quality? I think the one thing is that understand the use case and find out what dimensions matter to you the most. If if you’re going for have nots, then don’t try to get a perfect data. First, identify what the what good data means to you, because that’s going to really help you achieve that faster with the 20 percent effort you can reach at 80 percent. So identify that 20 percent and focus on that and then you can focus on that perfect data as a second step.
Harper If someone were to go out and start working with their team to better understand their use case to to improve their data quality. Who do you think would be the important people to collaborate with on that? Should it be cross-functional, should it be solely within the data engineering team? Who would it has to be sourced from
Vivek from the consumers of the data because they are people who are actually going to be consuming this. This data, whether it’s the research team, whether it’s other engineers on the services side, whoever who is actually going to be working with this data, they are the first people to start having conversations with that. What are the issues that they see and what are the issues that they care about and and why do they care about it? I think asking the why is the most important part that why do they care about this data being poured in from a particular dimension?
Harper I think that’s great advice like that that fits into the whole democratizing that I talked about, like everyone should have access to this data and then once everyone has access to it. Talk to the people that are that are using it, the consumers that you talk about. So that way, you can understand the why that you just mentioned, right? Help me understand why it’s important to you and not only why this data is important, but why this particular attribute of this particular dimension of the data is important to you. So that way, I can ensure that it reaches the trustworthy level that you’re expecting coming from there. So I think that’s great advice. I love it, Vivek. This is an awesome episode. I really enjoy your thoughts on this. It’s it’s fun to talk to people about data quality who have been doing it for a long time because I agree with you a hundred percent, like the fragmentation that exists in the space is can be frustrating. But also it’s it’s kind of is a new frontier, right? Because there’s so much area to improve and standardize and find ways to talk about data quality and how it works across industries. So hopefully everyone listening and enjoying it as well. But thank you so much
Vivek for your time.
Honor Yeah, thank you so much. Vivek loved learning about the use case of Netflix. Super cool. Thank you so much for coming on.
Vivek Yeah, thanks. Thanks a lot. Thanks, Harper. That’s great. Having a conversation with us today. Thank you.
Honor Thanks so much.
Vivek Bye bye.