Defining Data Quality: Data SLA Nightmares & Lessons Learned

Databricks Sr. Staff Developer Advocate, Denny Lee, Citadel Head of Business Engineering, Vinoo Ganesh, and Databand.ai Co-Founder & CEO, Josh Benamram, discuss the complexities and business necessity of setting clear data service-level agreements (SLAs). They share their experiences around the importance of contractual expectations and why data delivery success criteria are prone to disguise failures as success in spite of our best intentions. Denny, Vinoo, and Josh challenge businesses of all industries to see themselves as data companies by driving home a costly reality – what do businesses have to lose when their data is wrong? A lot more than they’d like to believe.

Defining Data Quality: Data SLA Nightmares & Lessons Learned

About Our Guests

Denny Lee

Sr. Staff Developer Advocate Databricks

Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Vinoo Ganesh

Advisor Databand.ai

Vinoo is an experienced software engineer, architect, and startup advisor. He has extensive experience building data pipelines and leading data engineering teams. Most recently, he worked as CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Prior to that, he spent a number of years at Palantir Technology, leading both software engineering as well as technical implementation teams.

Josh Benamram

Co-founder & CEO Databand.ai

Josh is Co-Founder and CEO at Databand.ai. He started his career in the finance world, working as an analyst at a quant investment firm called SIG. He then worked as an analyst at Bessemer Venture Partners, where he focused on data and ML company investments. Just prior to founding Databand, he was a product manager at Sisense, a big data analytics company. He started Databand with his two co-founders to help engineers deliver reliable, trusted data products.

Episode Transcript

Honor Welcome, everybody, thank you so much for joining us today about our talk on data nightmares and lessons learned. I’m honored and I’m your host and Product Evangelist here at Databand. We’re so thrilled to have a gathering of data experts with us today, so I’m just gonna introduce everybody here. Denny, we welcome your senior staff, developer advocate at Databricks. He’s also a distributed systems and data sciences engineer with extensive experience developing internet scale infrastructure, data platforms and predictive analytics systems for on prem and cloud environments. Welcome, Danny. And then we also have the new ganache, so we’re very lucky here a data band to have a new as an advisor. The new is an experienced software engineer, architect and start up advisor. He has extensive experience building data pipelines and leading data engineering teams. He was most recently CTO embarrassed and before that, he spent a number of years at Palantir Technology, and he’s also starting a really exciting opportunity soon, so can’t wait to hear more about that later. And of course, a lot of you already know Josh Benham Rim, our very own data band co-founder and CEO. And what some of you might not know is that prior to founding data band Josh was a product manager at CI Sense. He began his career in the finance world as an analyst at stake and later at Bessemer Ventures. So we’re gathered here today to really talk about we mentioned nightmares, so mostly the emphasis will be on lessons learned. So data driven businesses around the world recognize the cost of missing data escalates. And yet there is still almost like a mystery around how do we even make this achievable? So before diving into this discussion and getting all of our experts individual experience on this topic, maybe we should start with some foundations. So come to a common understanding. So let’s start with what our data is.

Josh I like the way that the new lay this out. I love to hear and set things up with a definition.

Vinoo Absolutely. So if you look at the historical SLA development, SLAs were measured in uptime and usually five nine six nine were the metric for understanding that a system was meeting its SLA. Now, if you actually dig into that a little bit. These are software SLA systems actually isn’t required, even remotely be functional. So actually have seen it in a contract where you can hit an endpoint and it’ll 404 for. But that still counts as uptime because there’s nothing about the system actually being functional in there. Sounds really funny if you do any laughing, but it’s actually really scary in production. So the evolution of SLA in the data space really came from the software. Space data has one additional component that software doesn’t really have. Functionality is opinionated, meaning functional usable data sets can mean different things to different sets of people. So Josh and I actually framed this in a Maslow’s hierarchy of needs. Methodology It sparks on it where we focus on the fundamental baseline layer, meaning the bottom of the pyramid, as did I even get data, is the data in a non corrupted or uncorrupted format where I can open it if it in the right format. And most importantly, is it even like there? So it’s easy to miss an SLA and not do it, get a delivery and have that be the kind of be all end all one level above that. We kind of modeled this as data longevity and specifically like a data set is no longer a static concrete entity. It will be amended. It will change over time. It will have updates, removes bits, deletes and upsets you, saying they’re calling it. So the challenge really is how does this dataset evolve over time and is it usable over time? So do the the cardinality of individual column is a change or to stay the same over time? Because the even distribution of data in an individual like bucketed way of another column remained consistent. And are these actually predictable? So the predictability of the data is the driving principle here. The top part of that pyramid is only three layers to this pyramid thus far is really how usable is this data from a business perspective? Is it making the impact that we need? This is a hotly debated layer, largely because if the data is there, it’s kind of up to the people to do or the analyst to do what they want with the data. However, in reality, you as a data producer or you with a data manager need to actually ensure that your data is usable for the use case and for the outcomes the business needs. So the SLA in my world is some combination of the top down elements of this pyramid, which largely differs from organization to organization.

Josh I mean, it’s an interesting point. The how different SLAs are in different teams. And I think that that makes it harder to pin down what an SLA means for aid organizations. There’s definitely patterns there. Things that we consistently see as teams want to guarantee that the data arrives at a certain point in time that people aren’t waiting around for it. That the structure of data is what you expected today, and it’s not going to outright fail or break any pipelines or disrupt a dashboard or report that’s being worked on downstream the completeness of the data and how much is coming through. There’s got to be like patterns I think we see and organizations that we work with or that are using data banks. I think what’s what’s interesting is when we see the differences between organizations too, we go into one team that that doesn’t care whatsoever if data is laid, but they need to know to a really finite level of detail that the distributions within a table, the skew is really well-managed. And that might be that might be because it’s not so much a report that and customers are going in and looking at every day at the end of the funnel. It’s a data science organization that will get started and build their experiments based on whenever the data is available. There isn’t such a tight constraint on the timing, but the quality of the data is foremost. And so I think what’s interesting is like when the miscues start to happen and these definitions, but there’s definitely some common patterns that I agree with you on.

Honor So data SLAs, it really seems like even though there is a general pattern and agreement, it’s still very use case specific. So that is really what we really need to look at in order to understand that. I’m sorry, Danny, go ahead.

Denny No, actually, you hit. You hit the nail right on the head. I mean, so for example, if I was to go pull like the past about how you when we had like a centralized database systems, right? Then the data SLA was more easily defined because it was, is the database up? And that was it. That was literally the definition. And then we’re done for the dead. Right. But in this day and age, the problem that you have is that, first of all, we’re talking about big data, microservices, cloud and all these other things. It’s not just a single database. The data can be from multiple sources going to multiple targets, which themselves are sources for multiple targets. Right. So when you put all these things together, you you end up having a very microservices style design or, you know, all the big buzzwords of DevOps and everything else that kicks in. But the reality is what? That’s why the SLA becomes more important, right? Because you can only control so much of what, either before you or what’s after you. Otherwise, you have to control the whole system. And the reality is you can’t control the whole system. So you actually have to have that limited, you know, at least each individual team. Now they could all build processes together to try to minimize the likelihood of a failure. But that’s the whole point of the SLA in the first place. Right. Still allow that that team to say, OK, I’m I’m expecting this data to come in this with this latency, with this schema and that you tell me when you’re going to change the version of the schema or at least provide me B2B one. So my my subsequent systems don’t break down and the subsequent teams can go ahead and actually provide their own SLA as well. And so it’s exactly to your point, right? Is it’s very software engineering centric, right? But this type of thinking. So the good news is that the data SLA is a lot closer to the software engineering SLA. Now that’s the good news. The bad news is that data SLA are still have their own subset of things that are separate from just what a software engineering SLA is, because now the data itself like, you know, people talk about data your own code. Yes. From the standpoint that you care about it, you need a version and things of that nature. But the reality is that it also has its own set of state own set of properties that come into play. That isn’t when you build software. By the definition, it’s stateless system data. By definition, it makes us think, all right. And so because of that, now you’ve got a whole different set of problems that kick in.

Vinoo So I think one interesting thing here is also what I would say on my team as well. Who gets paged? Are you actually paging a data engineer? But let’s say the card analogy has changed of a column like, Are you pitching a data engineer when the system is completely up? Are you paging a data scientist analyst whose analysis? Let’s change. Are you paging the relationship manager of the upstream vendor? Who’s giving you this data? They don’t like being paid, but we actually put them on our page. So it’s really like a who gets paged. And this is why the definition of an SLA violation is so interesting because I can have a completely functional system and have to be violated. And that’s. Actually, the pretty big step here where, you know, in the past, the spark that would fail or like, you know, like I wouldn’t be able to crawl and then point me more now it’s everything’s working. There is no job failure. The DAG is run perfectly, but hey, I’m start doing what I want it to do.

Denny Picking backing off of a news like statement about the far, far right. I mean, it’s great. Like, no, my data system is working perfectly fine. You know, it’s all novels, but it’s perfectly fine, right? And so it’s like, yeah, you hit your SLA from the standpoint that the system is up. But from the standpoint of is your you’ve corrupted your data. I’m pretty sure that should break some form of data SLA last time I checked.

Honor So where to bring this? I mean, I’m curious with each of your experiences in your current and prior data teams where this has happened, right? Like, everything seems to be working, but it’s not. Tell me more about that. What does that actually look like on the ground?

Josh I think I get so so I can offer perspective on cases where things weren’t working, and we very well knew they weren’t working. It’s a little, you know, I can only imagine the amount of times that we thought things were working and they weren’t. But I think of a few like really nightmare scenarios for for us and my previous company. I will not name the actual company, but you know, it’s on a short list. So and one of my previous experiences, I was working with a product organization. We had a team of data engineers that were working on pipelines that were mostly in charge of pulling data from our product. That was really the most important information source that we had, and this would give us insights on to things like who was using our system, how many users were engaging with different features. And we would use that information to decide whether or not to invest in more capabilities of our product in a certain area, like whether we should build on a feature and in some cases, whether we should deprecate things. Now these were these were not the most complicated pipelines, but we did have a lot of systems that we needed to interface with. And we were, for example, mashing together data from joining data from Salesforce. So customer data with data from our product tracking system. So things got a little bit complicated in there. But the the problems and the friction in managing these pipelines just cascaded to a point where it was pretty routine that data would go offline for days at a time. We would be waiting for like a full week to understand how many customers were using a given product feature. That was that was fairly routine for us. There was one scenario that we had where our organization needed to make a quick call about a feature deprecation, and we in order to do that, to do that responsibly. We, of course, went in and checked how many clients were using the feature because the data was down and we didn’t have any available information for it. We tried to build some extrapolation. We called our CSS, are our customer success people and support engineers. And we were asking around, you know, how many people are using this integration point, right? And we got that answer. Well, I don’t think anyone’s really using that. We never see it coming up, and this feature was really, really hard for us to manage. The source provider was always changing their API on us. So we decided, OK, you know, it seems safe. Let’s turn it off. It seems like there’s there’s no one actually leveraging this capability. A few days later, the messages started streaming into Zendesk. What the hell happened to our data? Where is it? Where is it coming from? This is why we purchased our solution. We’re beyond test drive. This actually led to a number of our clients turning off of the system because of that data just being turned off without us really having any good explanation for it other than it was just a matter of had the data been online, had that data been delivered for us, we would have seen it wasn’t a crazy number of users. It was probably like half a dozen organizations that were using the service, but they were very invested in that source of information. Had we known about that, we would have been able to made the right call. Keep this for us online or at least give some gradual deprecation process that managed expectations with the client. So that was one nightmare or one battle wound that I have from my from my track record of working. That showed me how important it is to have these kinds of accolades really well, determine with our clients. Here is the data that we expect. Here’s the data that you should expect. It’s written into the contract. We know that needs to be delivered on a certain amount of time and having like an informal contract. Even internally within our organization, eight engineers, two product people that would have also helped to make sure that this was avoided in the future, but that’s something that I can think about from my own experience.

Denny Did your story, actually? Definitely. It brings up this concept of the importance of instrumentation to put it rather lightly. Right? So because you talk about the SLA, you talk about the data. So I completely agree 1000 percent. But what ultimately happens is that it also requires your software engineers, your data engineers, to actually instrument correctly and actually care about the metrics that their instrumentation is right. So I’m not going to mention names because you probably get reading from doing it. I won’t do that, but in my past life, so I’m going way back, though we had built an analytics service, right? And it was really interesting. Like they they had come from a very different environment, but from the folks that originally built it. So they were building the instrumentation, which is like old school, like dot net logging. It just it would log everything. And when I mean everything, it’s like the log that was generated was larger than the actual data being fake, like it was just that level of insanity. And so because that we can never process the logs fast enough to figure out what was going on. And so I’m going to make an example up, but it’s close enough to the real scenario like it was exactly to your point. Like, they were like, Oh, nobody’s using this thing. Count queries for your for your web outlook reports using it as an example, right? And you’re going like, No, I’m pretty sure you know me setting up the product. My are going, No, I’m pretty sure people use that. In fact, web analytics, that’s the number one thing they need to know. I mean, yes, we get the most confusion because people don’t know, can’t tell the difference between a count and just think out sometimes. But still, it is easily the number one thing you have to do. And they’re like, Well, congressman patient, nobody’s using it. And it’s like a long story short, they shut it off. Similar and same idea streaming in, and it’s like, it’s not like we actually didn’t have the data. So in this case, we actually did just that. We had so much of it that they didn’t know how to query it. The obvious thing that said very clearly, oh, by the way, the number one feature by every single customer was this this account for like, you know, number one, right? But they actually didn’t instrument it correctly, so they couldn’t figure that out, right? And then. On the other tough flipping over, and this is actually the end of our story, so I could not actually talk about it, right? We did instrument this correctly. And so, for example, one cool story from a product management perspective. So in other words, what what originally I thought would be a nightmare until I turned into a like a super positive thing was that we had this is like a while ago now, so we had released finally released the spa car to work on like our notebooks and Databricks. All right. And we’re going like, like those a lot of asked a lot of people, you know what? People want to use? A lot of people start up, but they’re not really using it. What that’s going on. But because we were instrumented properly, we recognize, oh, they’re all being stuck on installation of these five modules. So no problem. We found that out within the first what I wanted to say 12 hours, like literally it was just telling us, Okay, so people are trying to sell that we anonymize, obviously. But the idea was like, we just could tell right away. It’s like, OK, right, people are trying to solve that and they’re feeling of isolation because those installation structures are a little tricky. OK, no problem. So we just automatically pre-installed those five and then pull offshoot of people to use it, right? So it’s like exactly to your point, like we talk with data soldiers, but we don’t. We often with people forgetting is the idea that, well, in order to achieve that, you have to sprint properly. And then that means you yourself, the data engineer, the data scientist, you actually have to care about these metrics so you can actually interpret them correctly. And so you can actually do something positive with it as opposed to making a horrible choice and then pissing off your customers are turning them up.

Josh So you would have you would have recommended for our engineers that they were actually tracking the same metrics that we were tracking from the consumption side of that. Is that

Denny what? You’re Absolutely. Yeah, yeah, absolutely. I mean, my attitude about like giving

Josh them access to our analytics report. So like how how do you imagine they’re they’re doing that?

Denny All of the above, actually. I think each team is going to do it slightly different, you like. There is always the dream of a centralized system for centralized reporting, and I think for larger organizations they can get away with something like that because they actually will have teams dedicated specifically to build that. I think for most teams, they’re going to try to leverage an existing existing environment, existing system, and that’s fine. I mean, my attitude is that they all try to standardize on something so that way, whether it’s the product manager, whether it’s the engineer, whether it’s the CSC, whether it’s frontline doesn’t matter, they’re all looking at the same thing. That’s the key thing. They’re all looking at the same information. So that way they can all agree, OK, that’s actually what’s happening to our customer versus that’s not right.

Vinoo I think the interesting thing here is these metrics, this such an ill defined term. And actually, I have to tell the story, Josh, because this actually happened. So I was giving Josh feedback about his product back in May, like, I guess, a few weeks ago and I was writing a data as a service company. That company actually was pretty interesting because we were both data consumers and that we had data from providers as well as data providers. So we kind of saw this duality of data coming in as cleansing, doing magic and data going out. And we talk about these metrics because data Ben was unveiling a new feature and I was actually on Zoom with Josh, and he was like, Well, tell me how you would actually figure out whether something went wrong on screen share with no proprietary or sensitive information sharing. Of course, I opened up our data rich notebook that was running, and I was like, Well, usually that would be a problem here. As I was looking at it, it was like job runtime, five hours, five hours, five hours for every preceding day, then like 11 hours, then like 12 hours. I was like, Oh man, I had no idea something was going wrong. And I mean, how many people are logging in, like if the airflow succeeds exceeds you, just assume that everything works, which really goes back to the point of as a data practitioner. What does that mean and what level of that hierarchy of needs? I define. Do you want to be operating at and in some cases, you know, owner, you asked about what happens when you guys always go sideways? Well, I bring up carnality because we’ve had an incident where the card analogy of a column all of a sudden vastly changed, which affected the partitioning of the dataset that was written out and resulted in effectively an human open. So we delivered data that went to customers who had jobs configured with certain memory. And all of a sudden, they couldn’t open the files. And then the question really comes, Well, you know, we’re fast. We were able to fix it and remediate it quickly. But then there’s a bunch of open problems who tells our data provider who actually either rewrites or pays for the cost of rewriting the data. How are we supposed to know the heap sizes or the memory sizes of our customers jobs? So all of these things kind of come back to when things go wrong and without an explicit definition, they go very, very wrong. And that definition is exactly what should be in these metrics. Everything from OK, the distribution of this column has now changed, and all of a sudden it’s like one standard deviation away. Well, someone has to be recording the historical context of what the values in that column were. It’s just as easy as to say, well, some call over some customers somewhere will be annoyed when it advertently. We add more data to the data dataset and the distribution then changes even more but positively. How do you actually build those types of changes and those expected changes into your historical tracking system? This could be data versioning people version code. Burgeoning data is not being messy, and projects like that are the logical step here. But yeah, the the side where this goes sideways really comes from a lack of definition and the lack of clear, not just contractual expectations between the providers and the consumers.

Honor It’s also the communication, right? I feel like the piece that slides almost force people to actually have a conversation about what is important to you and what’s important to me. And seeing it in a context that is up to this at the end of the day will have this impact on the business. And this is the total cost of not doing this right. And just the cascading failures that I’m hearing from all of your experiences, like can we put a price tag on this? Like, what does it actually cost when this is not handled properly?

Vinoo I have to jump in here because I think this is the story I was talking about before. The hardest interview question for my next opportunity that I’ve gotten asked is how do you quantify the value of data in your organization? This was to the CTO of the company from the city of a pretty big company. And it was incredibly hard. There’s so many factors in there and it can come from anything from the cost of compute, the cost of acquisition, the cost of the impact or perceived impact in the business. And just depending on who you ask, it’s a very, very difficult metric to actually capture. However, it’s so pivotal for you to actually be able to quantify as a data practitioner, data engineer, quantify the value of data such that individuals in the organization are actually able to not only understand why it’s important to set up these monitoring systems, but also are able to conceptualize even the cost of an SLA being missed. You know, there’s I can talk about examples from Palantir and everything else, but I don’t think anyone needs an example of what happens when data that is not there or data that should be there is not there. Or even worse, data that should be there is there, but it’s incorrect or incomplete. And so these are incredibly challenging and kind of you as the individual who are arguing for not just these resources, but the response aspect of it need to be able to at least quantify in some fundamental level the impact of data for your business. That said, a bunch of buzzwords in the next logical question is how do you do it? And I think right now the reality is that it’s going to differ from time to time. The quantification of a missed SLA in software is still open for debate and open for questions. The same will apply to data, except it’s multidimensional. It’s no longer. Just again is the data there it is, is the data there. Is it usable and evolving over time in a way that makes sense and as a business impacts still there? So actually figuring out the metrics to quantify that data. The only way you can get there is figuring out the metrics of the data itself, which is exactly what Danny was mentioning and just mentioning before. It was the key metrics that quantify the value of a data set incredibly interrelated to the metrics of that dataset.

Josh I think there’s two different kinds of companies that will work with, and that one is a lot easier to measure. One’s one’s a lot harder to measure the value of the data. One kind of company that we work by, the data band data is their product. Right? So like if you turn off their data, that’s customers that aren’t getting what they’re paying for. That’s kind of a very clear tie to value because you will start losing customers. So like in my previous experience where I was mentioning we had some clients churn, I can quantify that pretty quickly. Let’s say these clients were all paying about thirty thousand dollars to our solution. Part of that churn, we’re talking about a six figure loss to the business right annualized, you know, every year. So these kinds of discrete events accumulate to like a clear value of a data set because there’s someone else on the other end paying for that data say you turn it off, that’s revenue out or into the company. The other kind of client is a lot trickier to measure, but data might be just as important to them. And that’s a company who sells any kind of widget. But data is fundamental to how they drive their business, how they make decisions. So we’re a shampoo company. We create cosmetic products, but we decide what products to invest and what geographies to go into based on data that we get from our market. A little harder to quantify there, but there is the kind of dotted line, you know, we saw from our data that people in New York are a really good market for us to sell our shampoo. We decided to invest there. Now we saw that relate to 20 million more dollars a year in business. That’s tied to a data decision so you can start to back into those kinds of value propositions that way. I think another really fun way of doing it is turn off your data for a day and see how many analysts and data scientists are raising flags to figure out how much you’re paying those folks, how long they’re going to be with the company. When you keep turning off your data like that and you’ll have another way of kind of measuring the impact there, don’t actually recommend anyone does that. But if it happens, it happens and you can you can sort of back into your your your values.

Honor That’s a riskier style of measurement.

Denny Yeah, I I would look. Don’t forget, there’s also there’s also the context of data breaches, right? So they’re starting to at least puts that at least especially in the financial and the health care markets. But this is starting to happen even outside of those markets where there is a fiduciary responsibility for every breach. Every breach that happens cost you x number of dollars, right? So whether it’s the breach itself, whether it’s the app, just like you’re saying, gosh, whether it’s the data service that you or you’re doing right, the reality is. Now, exactly to the point about being multidimensional, it’s not just about is the data there. The question is, is the data even correct? Is the schema correct? Is the data corrupt? Is the data actually representative of of of what your of what reality is, right, you’re trying to do? A retail is trying to do predictive analytics to say, I’m going to go into this market in Central Europe, OK, and I’m going to go ahead and introduce this flavor candy bar. I don’t know whatever, right? And and remove another one, right? Well, they need all that data to figure out actually if how much money they’re losing and how much money they’re gaining by doing such a thing, right? So all of these things adds up real fast. And the reality is, every single major company right now, it’s sort of understood that whether they need it or not, their data is now part of their actual company. It’s actually part of their service. That’s the whole reason why this privacy advocates rightly so. I’m not knocking them quite the opposite that I can go on for hours about the need for privacy and different varieties of manager. But the context is that if you if you’re going to survive as a company, you’re going to actually have to know what to do with that data and how to protect that data, right? And it’s it’s not that it’s beyond belief because of that fact. And the reality is the only way most people are going to rock that or understand it. Exactly to your point is there’s some financial barrier. Right. That’s the only way. So that means right back to the next point metrics. How would you define this in the first place? Are you actually spending the time to figure that out? Because if you are spend the time figure out, then you can. Even if it’s not quite correct, you can at least estimate. OK, I’m losing x millions of dollars because of this. So let’s not do this now we know what we should be investing in, right?

Honor Yeah, and I recall you mentioned this before we were talking that sometimes there are slaves that are like a twist on us, right? They’re not your traditional slaves, but it is a set of success criteria that in using Joshua’s example, there are businesses where there is a very explicit business value for data and then there are other organizations where the data is embedded in the business, but it’s still important to meet those criteria. How do you how can we? Implement a degree of metric or measurement for the slaves within the slaves within the first four for the more complex use cases.

Denny Well, absolutely. So for example, the migration scenario is actually where the slate ends up twisting on itself. Right. So for example, the one of the past companies, they went ahead. They’re offering a service that is dependent on the data. So they have an SLA, obviously, to provide that data. But then we also need to do analytics of that data. And we were doing a migration to go from basically thousands of SQL servers into a to do analysis, basically inside a Hadoop cluster. OK, so that’s cool, because that way we could actually do the analysis. The problem was that, OK, if we actually requested the data too quickly, we would actually impact the application itself. In other words, the ability for SQL Server to give the customer their data. So the SLA we actually had on the so we we had two sets of SLA is one that was we had to make sure it was still within a few seconds of, you know, as soon as that, that request was made within a few seconds, all the information was available to the to the customer from an analyst perspective. We had an SLA, no joke that was that we would only use one percent of the CPU on the said SQL server. That was our SLA. So that’s that’s that’s an odd SLA. But because we had to make sure we were not beating the system. So we actually literally use spark and then ran it to slow it down. Specifically, we went our way. It was the weirdest thing ever, but we basically designed our spark job to be excessively slow, so we would use little to no CPU extracting the data out of the SQL server. So that way we could still meet, though interesting enough, our our four hour, six hour SLA in terms of like in terms of analysis, but we actually had enough slices. Yeah, you’re only using one percent of your. Yeah. So that was a funny one.

Honor Very interesting. No, I think, but I think it’s like it all comes down to that definition coming up with what is that set of criteria to determine success? Bring it back to the conversation about

Denny what I would

Josh like that to me. Would you consider that to be an affiliate based on like a business metric? Does that does that CPU? Yes.

Denny Absolutely. And that’s not right. Yeah, yeah. Yeah, yeah, because basically

Josh the from the data, it doesn’t really have anything to do with the data itself. It’s kind of about the project is about like other systems that the organization run. Do you does that kind of hold into the metric discussion that Boehner was saying before changing your mind?

Denny Yeah, absolutely. Because if you think of from the application perspective, right, they they were basically saying they had designed their application to ultimately make sure that they were only using they would max 80 percent CPU utilization for the applications themselves. So because of that, then we knew we wanted X percentage for the OS itself. So that’s why they were able to come up with a one to two percent. We actually technically applied, but we were trying to make sure that we hit one anyway. But basically of the CPU cycle. So that way we were all within our SLA in terms of not bogging down the server. Now then there’s the metrics that, of course, of the users that whenever we were loading the data, we’re expecting the data that they basically had. They were within the standard deviation of of request time that we had basically no impact, the standard deviation whatsoever. And so we achieved that at the same time, we still had to pull the data fast enough or schedule in a particular way. So that way, we would hit our for ourselves. Right? I mean, that was a more aggressive one. Honestly, we had we probably had 12 hours, really. But again, we were being a super aggressive. So we were. So we would we would at least mentally say, no, we were going to try to hit four for ourselves. So that way, from a global dashboard perspective of seeing what’s going on with our customers, we were on a worst case scenario for hours delay on knowing what was going on.

Vinoo One of the challenges here really comes down to also know we’re looking at business or engineering is generally a business that is tied largely to an engineering SLA and especially on resource constrained environments. And we’ve run smart jobs. It if it’s getting an old pie, sparked days. You do a collect the driver with whom you’ve got a memory and the thing would crash and you are on a multi-tenant system like yarn and back in the day was putting different drivers on the same box and then packing in that magical way. What that really comes down to is you every engineer is a shared resource on these systems, especially in the cloud. We may not see the internals out. EC2 is distributing resources unless you get dedicated hardware. But we are all sharing systems. We’re all sharing resources. And so business of like, we can’t hit this Oracle server or SQL server, forget which one was more than five percent is actually really reasonable. If you have another application that’s hammering this thing and have that, it has like, you know, five hundred thousand persistent connections to the database. And so the challenge really comes from you as a data engineer and you, as someone who’s trying to extract value from data, are also working on antiquated systems. And you need to understand not just the the business SLA. I mean, if the TV is down, there’s no business anymore, but the engineering and kind of why they’ve put those in place and give a quick story here. I was working at a customer site around here and every time our spark job and it’s on Prem, but every time we would write our spark job, it would just magically be killed. We tuned it. We performance like everything we possibly could and then would just be like exit code one. And we were not admins on the boxes. We didn’t know what was going on and kind of took us a few hours to figure out there’s an unknown killer on the box. So if this memory takes up and it takes up too much memory, they don’t want anything to happen to their core business applications, so just kills the process. And if you think about that, it’s it’s such a contrived way of actually solving an issue. But hey, it solves the issue, and there’s a reason that it’s actually there. You can have criminal panics if you overuse resources and bring down entire systems. So those business and engineering delays are actually very intricately tied. And that’s why I always think of the data practitioner kind of as like an E.R. doctor in that they need to know and understand everything that’s going on, but solve the immediate problem first. So if a patient comes in and you know they’re bleeding out, the first thing you want to do is stop that bleeding, not figure out, Oh. Predisposed to diabetes, they should run or do something else. Just figuring out the immediate problems at hand. And that’s where I think especially with these slaves that are defined and kind of just handed over. It’s really up to the individual who is operating on those systems in this case, like Denny or anyone, the Databricks to figure out, Hey, what do we actually have to do? And why is it this way? Which just challenging? Because now you need to. If there’s a relationship aspect of it building support from the internal IT organization, there’s a business aspect of it which is, well, if I kiss off my business users, they’re not going to come back and buy from me. And there’s also this hybrid, which is when things go wrong. What does the incident response look like? And the last thing I’ll say here is understanding where you’re operating as a data practitioner is everything. When I say, Where are you on a customer’s cloud? Are you in their enclave or you on Prem? Are you in some multitenant system because all of these require fundamentally different sets of cells? And fundamentally different systems? So that’s when you mention like the business or engineering. I think it’s such a challenging problem because depending on where you’re deployed, so many different things can happen.

Denny So I want to add to that video that’s a great call out because, for example, I still remember in like when I was building some large systems back in my Microsoft days, you know, we were always trying to achieve five nights, right? And I think you were the one who corrected me, which is five nines means five minutes down a year or something like that. I think like two seconds or whatever it is, a month or two minutes a month or something like that, right? And no 30 seconds about right. But anyways, the context is that, yeah, you’ve got you’ve got five minutes, right, basically to work with right for the entire year. So when you have an on prem system like, yeah, you could theoretically internally do it by basically doing active active clustering and all, you know and load balancing all these other fun stuff because you have full control of full control of everything. But then it’s like, OK, yeah, but do you control the internet connection to your data center? No. Well, then guess what you’re going to do? That’s why we have active, active, global in order to achieve that goal. And then it’s like, Oh yeah, we solved about that with the cloud. OK, so you need five times the cloud. Are you sure you need us all? Yeah, we’re so sure. But so you’re multicloud them, right? So we can just depending on what you’re telling me, a single cloud isn’t going to go down for five minutes a year. Right. I can go on hours upon hours of articles about how what I’ve line and Azure went off on, and that’s not a knock on them. It’s a hard problem. So it’s not. I’m just like. So when you define things, don’t just go with the buzzwords like, OK, give me, maybe you do need feedback, OK, well, then that means you’re multi-cloud, right? Right from the get go. Or do you really need Typekit like. And then that’s fine. Then maybe you only need regional availability in a particular cloud, too. But the point is, people actually have to go through the exercise of understanding what they’re defining. As opposed to just simply saying arbitrarily, Oh, we have global availability. What does that mean?

Vinoo I really want to highlight that especially the challenging of like if you don’t understand that you’re being given, there’s probably something in there that usually after only a few years, you’ll know what makes sense. Like five nine is common. But if you’re being asked, Hey, you can’t use more than thirty two gigs of heat on this one terabyte memory box, there’s something in there, and understanding exactly what’s going on is incredibly important. So you actually figuring out and having those relationships with the counterparts is incredibly important and is actually the differentiator. So many people come in and, you know, it’s easy for data engineers to just focus on optimizing a spark job over and over and over. But in reality, if you’re optimizing a spark job for something that could have been solved in a five minute conversation like Hey can turn off the game killer, we can’t open your job, we can’t open your data. You’re spending a lot of cycles on something that may not necessarily be worth it.

Josh Well, I’m still really fixated on Dennis as a way of reducing the load on the production database. I think what’s interesting about it and I think what it really points out because that feels so different than my use case for my previous company and what we would have looked for with our SLA of make sure the data is available and on time and make sure that you can engineer is like watch our our KPIs on the business side. How many people are using that feature of the products and know that that KPIs being delivered, it feels so different than that performance angle. I think what it points out is like the SLA, it comes from dialog with your customers, with your stakeholders, the people that are outside the engineering organization and deciding first who your customers are or who’s going to be impacted, like who’s at risk of making your life hell if something goes wrong. Those are the kind of conversations that need to happen first to even determine what the SLA criteria are going to be and what your starting points are. And that’s going to feel different and different projects. If it’s a cloud migration, it’s going to be way different than a production pipeline, which is going to feel different than a training process and a machine learning organization. If things are going to be very, very unique, according to queer customers, are, I think, deciding that first is the right way of kicking off the whole SLA discussion in the first place.

Honor So we have a few minutes left, and I wanted to make sure that we end on a note of what are some action steps that can be taken right, like what would be our call to action with everything that everyone has shared? Like, what do we need in order to implement all of the changes so that you are able to deliver on all of these different use cases? Is it executive support? Is it? What does that look like?

Denny So I can probably chime in first on this one, at least from where I’m sitting, especially in the past. Definitely leadership support like right from the top. I think the way I phrased it in one of our chats was that I would rather have if I was forced to choose. I would rather have leadership care about this and actually bother to invest than actually having like Uber experts, you know, a few Uber experts be able to solve my problems. Now, obviously, I would prefer both, which is having a couple of people who were experts and leadership. But if I had to choose, I would choose leadership because the problem when it comes to these type of problems, when it comes to data is that too many times we are dependent on heroics or depend on somebody to come in and save the day. And that’s not how you’re supposed to build a scalable system, right? I mean, let’s just go backwards and talk about how do you build for scale? Building for scale means everybody’s involved. It means that doesn’t matter if you’re the CSC doesn’t matter if your customer support doesn’t match. We’re all thinking the same thing. We’re all in the same place, right? So and that only comes from the top. So, for example, I still remember one story from my past life where the application servers found I was dealing with the data escalate. We were fine, but the applications now, so so because I happen to understand how that it’s not under, I’m there doing the triage, even though I was technically leading another thing. But because I have a the system on up, I’ve figured I’d help out. And there’s RC, the CEO, the chief operating officer, sitting with us and he’s a cool guy. So he was just like, Hey, you want talk, you want pizzas? He’s like, because his attitude, like, I know that I can’t fix it because you feel you’re all going to fix it. But he wanted to be there and suffer with us and help out, right? And he wanted. And so he actually cared enough to know that we’re all staying up late at night to get the system working right. And then because he saw that he subsequently turned around, said, Got it. Now what do I need to do going forward? So you will never need to do that again? Right? And so for me, at least sitting through that experience, right, easily, that was the best experience for me because that meant, yeah, maybe we suffered all of that. But our CEO was going like, No, I get it my bad. You were trying to tell me this before. Now I get it. I’m going to make sure there’s enough money allocated for this. And so then and then we have enough time for training. And so it resulted in the wholesale difference in how we train all of our engineers, train, build all of our systems. And that, to me, was crucial for success. So if I can leave y’all with any one key thing, it’s that it has to come from the top. They all have to understand and believe it because they don’t. What ends up happening is, again, heroics. And yeah, you can probably do that for a while, but invariably it’s not scalable and it will fail.

Vinoo I really want to highlight the piece about heroics here, Danny and I were kind of tongue in cheek discussing this. He said that the model I’ve heard use here is a mercenary versus the general. So you can you have a general actually come in with this army? Do what it needs to do. Obviously, they need very important, very legitimate mercenaries as well. But there’s one unique thing about mercenaries. Mercenaries tend to work alone, and they’re very, very expensive. So I was kind of joking with Danny that if you want for maybe 80 percent of our time at our various companies, we’re not strictly needed like our skill set is not strictly needed. Obviously, it’s needed in some form. But it’s that last 20 percent when things go really wrong and they need someone who’s seen this before to come in and actually be hands on keyboard and fix whatever is going on. And so that’s really interesting from a business perspective because you you’re almost like hoarding these resources in such a way that when you need them, you can deploy them. But as Danny said, it’s not scalable at all, and it’s actually very expensive. So the question becomes, how do you actually build the general knowledge that? And for me, there’s only one real answer. What do we do moving forward? It’s metrics decidedly unsexy answer, but understand the actual data that is coming in. Understand what your expectations for data are. Danny and Josh both said Instrument instrument instrument. Make sure you have before the fact, actually understood what the expectations of your pipeline is are and make sure that you actually understand and have a historical perspective such that if or when things go wrong, you can rewind time. So I would say if there’s an action item, I would talk to your whoever needs to be that person involved and actually ask the question Hey, data scientists, hey, analyst, what do you care about in this data then? Hey, data engineers, what do you care about with this data? Software engineers, what do you care about? DevOps, I.T., customer success don’t have those conversations and come to a common understanding of in this case at this time and in this configuration, this is what your team collectively cares about as an instrument for it.

Josh I would say two things first thing. Don’t try to take everything yourself. Get help from companies and vendors and other organizations that are able to build up the pattern recognition. I think that’s really important as we create standards in this area for the first time. So whether it’s date of and which, of course, I would love it to be a different standard or any other companies that you think you can bring in to partner with you and help bring in better standards. I strongly advise teams to do that. That, of course, will come from executive buy in and make sure that there’s real perception of value in your organization and what you contribute to your business. So the second piece I would add is know your customers, know your stakeholders, make your head up outside of the core engineering organization and know who it is in the business that you affect and understand what their main pains are because that again, is going to be where always derived from versus.

Honor Awesome. Thank you all for your time today. This was such a fantastic conversation and so helpful to have action, steps and tips to follow up. I want to thank everyone who has joined us today in the session. We are going to be releasing more of these discussions in the future, so definitely stay tuned. Thank you, Denny. Thank you. New or being a part of this conversation. Thank you, Josh. And we will see you again very soon. Thanks, everybody. Bye.

Vinoo Thank you so much for watching.

 

Suggested related links: