Honor Hey, Harper, how’s it going?
Harper Hey, Honor. Things are going pretty well. Excited to have a Gary Cheung here from again, right with us. Gary, you want to take a quick second and introduce yourself a little bit about your background and how you got involved with data management and data quality?
Gary Yeah, yeah, sure. So name’s Gary Cheung. I’m a Staff Analytics Engineer at Eventbrite. I’ve been in the data space for about seven or eight years now. Works across a lot of different industries. Worked in. I was the first state engineer, hire a capsule pharmacy. I was the first state of your lab and 10. I worked for large health insurance companies like Cigna and investment banks like Barclays.
Honor Very cool. Well, we wanted you to come on today to talk about a very specific topic, in fact, is something that Harper and I discussed frequently is this idea of the single source of truth. And Gary, you had a really interesting perspective on it as coming from an angle of. What happens when that source of truth is either unreliable or you can’t really gauge whether it is true? So really want to pick your brain around what this means for your day to day?
Gary Yeah. So typically, when we look at data engineer, we think about pipelines and we think about the tech side of it, so orchestration with airflow or different types of orchestration tools and managing big data. But one of the biggest frustration as a data engineer is how reliable is your your source data. So the source of truth, depending on the organization you’re in, could be third party data from a vendor, or it could be data coming directly from your production application. And when when you can trust that data, it’s great when the data is something, when the data is data that you know is reliable and. And when you run analysis on it, it comes back with valid results. That’s great situation, but sometimes there’s problems with that sort of data. One example is sometimes you have systems where the data is manually entered into the production application, and so you’re getting data that that has like typos in their work or the end user misses and entry specific information that you have to catch all those anomalies. Other times, you have vendors who send data sets that are like missing a chunk of the data. And when that happens, it can be kind of difficult to catch the situations, especially when you know your production engineers or third party vendors don’t tell you that these issues are in the data and you have to kind of usually the worst cases. You find out about these problems later down your pipeline, when your analysts are running queries on the data and they’re like, Wait, we’re missing a chunk of data from from like February or like, you know, or like, you know, they’re like, Wait, this entire column here is completely unusable. There’s it’s like it’s missing or, you know, there’s issues with it. And so it’s a pretty common problem. And it’s actually what I think most the true data engineer spend a decent amount of time trying to solve.
Harper Yeah, you literally described like my nightmare scenario as a data engineer where I spend so much time trying to find how human error can be injected into my process? And then how do I prevent that from occurring, right? You talk about the analyst can enter something into an application web page where vendors are providing data and who knows what their process looks like to actually populate that they dataset at the end of the day. And so when it comes into your system, not only are there like processing and logical flaws like a potentially exist in your code, but like how do I find every edge case where a human has interacted with this data so I can ensure that it’s it’s sterilized returns extent, right? But taking one step back and just for a little bit of level setting, like when we say like the source and we say like the source of truth, like those things are a little bit different. Would you agree? And like, how would you talk about the source being different than the source of truth in the scenario?
Gary [Yeah, I definitely think the term source and source true, it’s not exactly seem like so there’s multiple sources of data that that you’re going to ingest. The source is truth. To me, it’s like you’re one. So there should be at least one data set that you consider your gold standard data like the data set that you trust, that you compare all your other sources against that that you are like. These are the numbers that that that represent my business and and this is a dataset that I consider the gold standard data. And so that dataset is different than source or sources in general. So that’s kind of what I think of a source of truth is like your your your most important dataset. So if your your insurance is your premiums data, if you’re in attack is your clicks and convert in conversions and impressions data and that that dataset should be your source of truth because it tells you it gives you all the transactional information for how your business is doing.
Harper Right, and what we’re kind of talking about here is like taking that gold standard of data and using that as the basis of your data quality framework to ensure that those sources that are coming in are actually meeting the expectations that your business has, at least from my perspective, what I’m looking at there. So I’m curious, like in your experience when you build out these golden datasets, what are the attributes that you’re looking for? How do you identify what the source of truth is going to be and focus in on that like that vendor ingestion process, right? Like if you have a source system that’s providing you with data that you don’t control when it’s going to look like they could change their API at any point in time, how do you identify what your your source of truth should look like and how do you ensure that your source of truth stays up to date, given a vendor could change it at any point in time?
Gary So, so a vendor in a production application team that’s internal as organization are basically the end of the day. From a data org perspective, they’re still customers. They’re like a third party. The only difference is you might have a little bit more influence with your internal application team if you guys are up to the same CTO versus the vendor. But but essentially, it’s the same process, whether it’s internal or external. Whoever your data provider is at the end of day, it’s the most important thing is from the beginning to set out SLA and expectations for for. When do you expect your data to be delivered? If you can keep a metadata on that data? So like, how many rows did you expect to be delivered and what our expectations, how many columns and and so forth? But but, you know, in a very ideal state which very rarely happens, you want to be working with your vendor, either internal external from the start when they start. Like creating that data to have your say on what you expect that data to look like even before the data is produced and then basically have them invested in the process, that that’s the most ideal situation, which almost never happens. Because what ends up happening is the data team gets the the data after they’ve built the application and you’re basically a, you know, a third party where it’s like, Hey, here you go. Good luck with this data set.
Harper It’s finding ways to avoid that, that waterfall mentality, right when it comes to data teams, they think that’s the core basis of what I would describe as like data governance coming about and us being able to engage everybody in the conversation from the vendor to the actual team working on it.
Gary Yeah, I totally agree. And something else you brought up, what’s really important is that, you know, it’s the edge cases that really, really get you because it’s it’s the biggest from data quality. It’s like a A20 problem of like 80 percent. Your problems can be solved 20 percent at times. It’s those edge cases that affect like one to two percent or data that takes days on days to investigate and try to find workarounds for when it comes to your source of true for your search for problems, you know, which I think is really interesting. And it’s like sometimes you just kind of like, Do I just ignore them? Do I just, you know, I want to throw away these these edge cases? And just because it’s such a time sink when you when you have those like one or two percent edge cases that that that are so difficult sometimes to try to solve.
Honor Very would be curious to get your thoughts on if we were to really simplify this, and I know it’s hard to simplify. But what are what are the dimensions of when we we are talking about source of truth? How many dimensions can we really consider to be relevant to that?
Gary You mean dimensions, what what what attributes of a data set makes it a source of truth or OK? Honestly, it’s so I think it comes down to it’s not so much of a data engineer problem as more of a business problem. I think your source of truth data said it’s the most valuable data set to your business. That’s kind of how I look at it. It’s it’s what your business defines as, like the the golden dataset. Like so for a health company, if your records and if it’s if you’re an ad tech companies, your application transaction logs coming, telling you how many clicks you sold and how many conversions you’ve sold. And if you know, and so depending on the industry, it’s a different sorts of truth. And in that dataset, I think dimensionality wise, it’s going to look extremely different based on the industry you’re in. It could be really, really wide and many have many columns or it could be really, really short. A few columns have a lot of rows and and asides and so forth. It’s going to be industry dependent. But, but but ultimately, you know, the source of truth to me is it’s less so purely of a data engineering problem and more so if, like a data engineering mix with business problem where it’s more of like working with your business stakeholders to first, like define it, this is your most important data set. But then working for your business stakeholders be like, these are the assumptions about this data that should always hold true. These are the rules that we should be applying to this dataset to make sure that this data is meeting the needs of the business and and holds value for the business.
Honor So just just to build off of that, considering the degree of business context reliance, my assumption then is clearly this is important. Businesses are talking about this right before they build anything. Is this is this a normal process to first set those assemblies to set what our business, what our data sets should look like before anything is done in the data organization?
Gary It’s it’s it’s it’s like what you should be doing, but not always what happens sometimes is lot sloppier of a process in that, but ideally you want to have this off the sleighs, your expectations, these rules said ahead of time. What usually happens is the business wants to solve a problem. You end up pulling in that data set because it’s your important data set and you find out like like maybe a month after development and usually in a waterfall method because everyone thinks agile, but it’s not really agile that, hey, there’s actually a lot of problems. And. Then you end up going back and then trying to fix those problems. And so in reality, that’s what usually happens. But ideally, you want to be setting these, you know, conversations well in advance.
Harper Yeah, in my experience, the AI, the way that I first get a flag about like the source of truth being offer that the business has a different expectation or perception of what that source of truth should be is it comes back to that meme. I’m sure that we’ve obviously in like business analyst come to you says, Hey, can I get this information about Q1 sales, about this region for the last ten years, right? And you’re like, Sure, let me just write this query that select star from perfectly aligned table where data equals whatever rate when those business analysts are coming to you or decision makers are coming with those questions. That really is them saying, Hey, I looked at the data that I know exist, but it’s not giving me the information that I need. And so my what I’ve come to start approaching that request with is instead of like diving into the metadata and diving into the schema and finding the right joins that are going to give me this answer to the question that they have is just asking a question like. How can I help you access this yourself and how can I help? And how can you help me understand the value you’ll receive from getting that information? Because as you mentioned earlier, Gary, like it’s the source of truth isn’t an engineering problem. It’s a data organization problem. And that means that you’re actually needed to communicate beyond just the scope of your engineering peers, and also with the hope that you have a product manager on your own, your data team, and then that product manager is communicating with the data consumers that are going to be looking at the dashboards and things that you’re maintaining as well. And. The most beneficial outcome of me coming back to them and asking how we can both understand the value that’s going to be received by this request is we start creating a regular cadence around the conversation that needs to occur to make sure that we are aligned on not only what the source of truth means for the business, but also how the data engineering team can use that source of truth to enact a data quality framework that actually makes sure that the data coming in matches that go to standard that they’re setting forth for us. And when you’ve worked at the other companies and even at events right now, do you all make it a priority to have these conversations on a regular basis and in your experience, like where has worked well in terms of like the regularity that that occurs and we can talk about ideal scenarios if we want to be gracious here?
Gary No, it’s exactly you said Harper, I’ll take it one step further. It’s like it’s it’s not just where is that data. Once you can get access to the data they start querying, they’re like, Wait, these numbers are off. Like, this doesn’t reflect the business at all. And you know, the problem comes down to usually most organizations are not proactive with the source of truth, data quality, whatever you want to term it. It’s usually reactive. And then that’s the that’s the whole point is that it’s to your point. It’s like you don’t find out about these problems till your analyst or business analyst starts working with the data and then starts finding the schemas. And then they’re like, Wait, there’s a lot of problems here, and then you have to go back and then fix some of them and then, you know, and then they’ll query again and you react and you fix. And so most organizations I’ve been at, I mean, unfortunately, I have to say it’s it’s always a reactive problem. And the problem is, is for startup companies, they just don’t they don’t see the value or they don’t have the resources to invest upfront in data quality because it’s a lot of time and effort that basically software problem that doesn’t exist yet, but which is in hindsight, that that’s really kind of naive, right? Because it’s going to be a problem, whether date software now or any software six months from now. For large organizations, it gets it’s more of a data governance issue where that the data it’s all over the place is the people that understand the data or work in silos, and it’s just a communication between a data org and the people who own the data and so forth. So even if they are investing the the the timing of building a data governance data quality organization, the effectiveness of that organization, in my opinion, can be kind of questionable. So it’s it’s a tricky problem for sure. I definitely think, though for my experience, it’s definitely worth being proactive on the problem versus being reactive and waiting for this issue to come up to then solve it, because that’s going to waste a lot more time for everybody.
Harper Yeah, that’s a that’s a really good point, and the I like what you said there when it came, when you said sometimes the business doesn’t seem to care about like whether the data is the right place or doesn’t seem to care like exactly where it is. And that that exact perspective from an engineer is what came what really helped drive forward the idea of creating a data ops culture at some of my previous companies that I was at and being able to understand, have the engineering team talk to the business team and say, like, why do you need us to deliver this particular query by end of week, as opposed to helping us understand how we can build out this process that ensures that you don’t have to come to us for this query next month again. And that conversation starting really helped us start setting forth the mentality of a data ops culture. And that’s really what hopefully all of data orgs are working towards this point, right? That’s I. When I say data ops, I’ve recently been told that’s not a familiar thing for everybody. So I’ll just say that the idea of data ops for me is taking dev ops up best practices and applying them to data management and data engineering so that we continue to see a better practice for managing data the same way that Dev Ops has created better practice for creating software over the last 10 or 20 years or so. And so in my previous experience of these companies, what we ended up finding is that. It’s not that the business doesn’t care about where the data is or why the data exists the way that it is. It’s just simply that that business is worried that it would take too long to answer the bigger question. And they need to keep moving forward so they can keep meeting sales numbers and calls, numbers and recruiting numbers whenever that metric is, that’s driving their their bonus or their team’s success at the end of the quarter. And. Because of that speed that they’re working at and they experience with data projects tending to take a lot of planning and intentionality to ensure that we get it right. There’s a disconnect there between the cadence and the agility that they’re expecting from one another. And at the end of the day, the reason that I’m really excited about the data ops move in taking those best practices from DevOps is that data ops can really be the way that data teams can start to move at the speed of business. And I think that that’s really the the thing that’s going to be the best outcome for people adopting these data governance strategies, like creating a source of truth that the business is informing the engineering team on. So the engineering can provide golden data inside of production systems as well.
Honor Really love that Harper and I. So I like to ask the really simple questions, because that’s usually how I think it’s almost like I think we talked about this with in another conversation we had recently heard about looking at this proactive approach as like health insurance or car insurance, that there is this understanding that there is reality running in its own track and going to your analogy about data not operating at the speed of business. So you’ve got reality and then you’ve got an eye completely removed set up of how we think data should be flowing. What is it going to take for us as an industry to start seeing that? Let’s just get the insurance like let’s not even get into this? Debate this all the time. Everyone knows this is important. Why don’t we just do it? I mean, what? What’s stopping us from from this becoming a new standard?
Harper I think that what’s stopping us is lacking the magic wand to wave across our data and conform it to the same dimensions and say that every domain has the same expectation of their data that the adjacent domain has for the expectation of like there’s. I say this probably too often, but data engineering is just such a high context for you and the complexity that’s involved in not only testing code but also testing data quality is something that the editors are still working through to make sure that they have a standard that can be applied across domains and engineering teams can have a common language to talk about what it means to have those data quality frameworks. One thing that I’ve come across recently talking to various peers in the field is when I say the word pipeline that means something different to somebody working in a different, completely different infrastructure. I see you nodding your head there, Gary, and I think it kind of relates back to this conversation. What sorts of truth? How do you have a conversation if you can’t agree on a common language? Like how do you set that common language between engineering and the business and making sure that that that you’re enforcing? That’s sort the truth. At the same time?
Gary Yeah, and even add on top that that, you know, sometimes data doesn’t have a seat at the executive table or even if there’s a categorization it it’s not represented at the same level. Teacher Wayne Harper like you’re moving the speed of business, but but you know, data quality, it seems like a tech debt, so organizations end up doing it. You have to move. That speed is if you do work arounds, you do lookup tables. You put in a little Band-Aids here and there to kind of fix these little patches. And then you do this for two years and then your entire data system is completely unusable because you have all these nuances of, Oh, I actually have to join this or do this to get this gear to work because we never fix this problem. And then, you know, at the end of the day, two years later, you’re going to have to to scrap everything and redo it all again, which can cost you a lot more than if you if they just done it correctly the first time and it comes down to sometimes. I just don’t think that like data. It’s, you know, everyone says that they are data centric and data is very important for making decisions in the business. But but I don’t think the business understands sometimes the the level of investment required to have a stable and usable data system that’s in an ideal state self-service. But even one that you don’t constantly have to put patches and Band-Aids on that to get it to work properly.
Harper My favorite story because it wasn’t my pain and was somebody else’s pain is hearing. So I did engineer tell me about. He was working at an organization, and they suddenly realized that the dashboard that they were populating just wasn’t being used as often as they expected to. They were checking the metrics on it, and they they reached out to the the sales team that was using the dashboard. And there the manager came back and said, Oh, actually, we we kind of created our own and we kind of took the data that we saw available that you all feeding the other dashboard with and then kind of cheap to do our way and created our dashboard that met our needs. And lo and behold, the data engineering team dips further in here and they they found out there’s this whole like separate data mart that exists inside of like the sales EC2 and since they had spun up on their own because like they were super savvy and wanted to get into it. And I’m all for that right. Like I’ve talked about democratizing data and everybody having access to that. But like you mentioned here, Gary, whenever you have people creating these ad hoc queries, these adult changes, and then that starts fitting their needs and they’re managing the data and fixing it the way that they need to. If that isn’t communicated back upstream, it just creates headaches down the road. Right. And then you have a more complicated data management situation at that point in time and.
Gary Yeah, it’s basically becomes tech debt that you don’t it never gets addressed, right, like it’s like it’s like, OK, this is part of doing business. I’m going to have to like Duty’s random changes to go around it. But but you know, no one ever takes the time to take a step back, like, wait, why are we doing all these workarounds? It’s there because because if you really think about it, you took a helicopter view at it. You realize if you just invested one month into improving your your ecosystem a little better, having better data quality controls, maybe remodeling some of the data sets, you could shorten the amount of time to create those insights dashboards. But but you know, that’s the thing. It requires the executive team to put the investment end of time to allow the data organization to do that. Instead of being OK, the data organization is here to serve us. It needs answered questions at our pace. And that’s that’s it. And that’s kind of how I feel like some organizations in certain industries. I’m not going to name who and where, but but but does view the data organizations.
Harper Do you think that that mentality from the business is kind of a hangover from the first or even the second wave of data management that came about with like the warehouse and like the on Prem at so when you’re an on prem world, like you’ve got bare metal servers and you’re managing all your databases yourself and you’ve got a data team that’s coming in and you say as a business owner, you say, Hey, I’d love to do some analysis on this. Can we create a data warehouse? And there’s a lot of horror stories in, like the 90s and early 2000s where people have this data project that sets out and it takes two years to finally get this data warehouse understood and working and running. And then by the time it’s there, it actually doesn’t answer the question. So you talk about like the business thinking about the data org as being something that serves them instead of being part of the decision making process in that table. Curious what you might think about, how that is affected by the previous experiences and how we might be able to get past that mentality moving forward?
Gary Yeah, I think I think what the new technology out today like, especially moving to DB10, to the cloud, to the time to ramp up, to build a data organization and the cost to build the data reservation, it’s it’s a lot shorter and it’s a lot cheaper than it used to be. I think I think there’s definitely probably take things out there that were burned by that experience in the 90s of having to build these massive snowflake data marts and put it into Oracle and then getting very expensive oracle license. It were definitely. Especially if you’re starting a brand new company. I think we’re at a pace where the investment to create a data warehouse or at least a bare bones data warehouse is much lower today than it used to be. But it comes ultimately. I think it comes down to. The problem is the general tech problem, which is that your non tech business executives don’t understand, like the amount of effort required to do certain tech things like build application, build a data warehouse, getting insights, and they just assume that everything you know should it, why should just take a week? Why should this take a month to do? And in translating the difficulties that come in with building tech tools, applications or or data warehouse to those executives, it’s the most challenging part which but that’s always been a problem for any organization.
Honor So magic wand, if we were to say you both are empowered to have a magic wand and do whatever you want to fix this. What are your first steps to addressing this situation to? Moving us to a place where you don’t have to have an entity that is made up of 90 percent Band-Aids.
Harper I’m assuming I can’t use the magic wand to transport me to my own tropical island and no longer have to worry about data or problems. That’s not one of the solutions
Honor that is after you’ve fixed the data.
Harper I don’t know. That’s a that’s a really good question. I I thought about it in different ways in the past organizations. But the magic one scenario isn’t one that I’ve thought about recently, like, get your take is on this.
Gary Yeah. I mean, it really is dependent. I mean, I think like tennis on the business and dependent on industry. I think the problem with this is that every industry and and even the size of the business is so unique that they have different business challenges. I mean, I mean, some some organizations have problems with manually put it user data, but some industries had that type of data doesn’t exist in all their data. It comes directly from web applications. And so those two different types of companies are solving completely different types of data challenges. So there isn’t like a blanket solution that fixes the problem across for everybody because of how varied the problems are in this industry. I think things that are important, though, is having that right investment upfront from the executive board. Instead of having data seen as like like a second priority, having the business work very closely of engineering on data problems. And so like not having silos where you have business users who use data and engineers to build the data and actually having them work together side by side as a team to kind of build data quality rules and set up like data quality and data ecosystem from day one. Instead of having this become like a passing the buck forward and having them react, having react to that to the business analyst and then fixing the problem. And then here you go, is this correct and going back and forth and back and forth, which a lot of organizations seem to do, basically having a system in place that kind of have people work together from the start? I mean, those two things would go a long way. And then and then also not over engineering your system, like not going so trying to build a massive data warehouse with multiple data marts from day one when you don’t need that. And that’s another problem that I see a lot is data organizations really over engineering their solution for data modeling and so forth.
Harper Yeah. One of the best analogies that I learned early on in my workings in different agile frameworks is like, Look, let’s just build the skateboard guys. Like, like, let’s build the skateboard Falcon Lake. Let’s then make that a bicycle, then let’s make that a car. Then let’s make that the Ferrari, right? Like, don’t start like you said, you don’t need the data warehouse that has every single datamart that addresses every single vertical within your business. Answer the question that’s most important first, right? And then use that to inform the way that you want to build up the rest of the structure around the warehouse. From there, we had an episode previously where we had Sarah Krasny and Sam Bell on. And the one thing that really came out of there that I found really insightful was the idea of whenever you find the data problems that you have or you find like items that need to be addressed or like the data tech that you referred to is communicating outward at that point. And I think that that really resonates with what you just said in terms of how you would take that magic. One approach is like, it’s it’s not the idea that there’s going to be a way for us to abstract the data problem across all domains. It’s the really comes back to the fact that we need to find better ways to communicate between business and data and ensuring that everyone is as aligned and working together at the same at the same time, right? And that at that point it becomes I mean, it doesn’t become it’s the same thing that I mentioned earlier. Like, it’s the idea of that data culture. It’s the idea of building the data organization, building that data mindset. So that way, instead of working in the silos that you mentioned, Gary, you’re working as cross-functional teams that are aligned on the same goal with different value propositions coming out of that goal being achieved.
Gary I’d like like it’s it’s it’s funny because there’s so many organizations, every everyone now that pretends to be agile or says that they’re agile. There are very few organizations I’ve actually seen operating in true agile mindset, mainly because I think comes out of a mentality of like, people are perfectionists and they don’t want to put out work that they don’t think is perfect or, you know, or it’s a visibility problem. And and fundamentally, it’s partially a company culture issue. So they pretend to be agile. They do sprints. But truthfully, a lot of these issues operate and still today and a waterfall mindset and worst waterfall manner. And these these these teams are siloed. And I think, you know, that’s a big point to point out there is that like organizations start need to start moving towards more agile mindset where you are building escape like, you know, just a deck and not the entire skateboard when, when, when you’re trying to to to build a product.
Harper Yeah. Say it louder for the people in the back, please.
Honor Yeah. Run a double. Click on that. And so do we feel that if we were to. Boil this down. So single source of truth is going to be dependent on your business, you basically are come together as a team to figure out what that means for your individual context and having a communication framework so that your data team is a part of the business conversation and not considered an external team or dealing with something that is not as important because as Harper and I talk about it all the time, all businesses in this day and age are essentially. Data businesses, you need those insights in order to operate. Is there like a testing framework we would be able to suggest that is universally applicable?
Harper I think we’re seeing more tools enter the data space every day, and we’re seeing some tools address those specific problems when it comes to like a standard library that could be used for applying business rules and data quality frameworks and being able to ensure that the outcome of your data pipelines and your data sets end up matching that source of truth, or at least are being compared to that source of truth or identifying didn’t really identify where your source of truth has holes or gaps within it. But we’re still at a place where. We haven’t been able as a as an industry as a whole. You haven’t been able to abstract the core problem, so that way it works across all domains. You still sit in that space where how do I make sure that the data quality rule for my health care company is going to match the data quality role that exists for, like the sports industry and and frankly, like those, those companies don’t care if those those data quality rules match right. Like like if I’m running a health care company like I don’t care if my quality framework is going to be applicable to a different industry. So you have like the data industry growing at this point. You have all these tools coming out to kind of think about all of these items and that sort of like meta level. But it’s just very nascent in this point in time, and we’re going to see more and more companies come in and try to address those problems. And, you know, over time, we’ll see consolidation. But I, if I were a betting person, I would say that there’s at least five or 10 more years of exploration of different ways to tackle these problems before we finally hit upon something that’s like, Oh, that’s really obvious, why haven’t we done that? And then and then we find like something that that works for everybody across the way. But I don’t know. That’s like a really good non-answer. But that’s that’s the best of I got it this way.
Gary Yeah, there’s been tools that have existed to try to solve the equality for intrastate equality for for like a long time like Informatica had had built in features and in like even DVT has these features too. But the thing is, you know, the tooling we use changes so frequently. You know, you move from one day, you’re on Informatica or IBM, you know, data stage. And and next thing you’re you’re you’re writing Python was high spark on EMR cluster and then you move on to Google Cloud. So every time you shift to it to a new new new data ecosystem, you have to consecrate a new tool, which is the tricky part of the problem. The other thing is, you know, to Harper’s point about data ops, like there’s a lot of core concepts in software ultimate test driven development or code coverage, which, you know, which blows my mind. It doesn’t really exist in data like we don’t have a kind of like data coverage or like, why don’t we apply the same test driven development approach to when we create, you know, data warehouses and data ecosystems? These concepts kind of fall off when we’re talking about data and and I think it comes down to like having a mindset change of like not viewing data from purely like, Oh, we’re just writing a sequel, we’re just writing code to having a much more ops mindset about it that, oh, there’s there’s also tests that we need to write. There’s also assumptions we have to check against the data, and this should be standard whenever creating anything within the data ecosystem.
Harper So before we set out, let me ask this last question. If you’re setting out today to start your new organization and you want to ensure that data? Org is having a seat at the decision table, you’re making sure that you’re going to start from a place that not only do you have a DevOps culture, but you also are building that data ops culture as well. What would you think is the most important aspect to instill? First is it is it the communication aspect of it or is it the data literacy aspect of it? What do you think is lacking that’s preventing companies to move forward with that data ops mentality?
Gary I think I think organizations need to be more explicit about setting the expectations for what they expect from their data org and from their data from day one, instead of just being like, Well, we just want insights like like we should. I think the data organization should set that we expect our data to be X percent reliable and what does reliable mean and so forth. I’m going to, you know, I’m not going to quantify that right now, but there’s definitely ways to quantify like like we expect there to be x amount, downtime, etc. and having that be the baseline for what data organic to me. And then from there, like, how can having the data work implement practices? Well, there have data officer so far to meet those requirements. So, so setting the expectation for for what the data or should be hitting in and working backwards to how can we meet those expectations might be the best way to kind of approach that problem.
Harper Yeah, I agree with the the the matter of having expectations of income. The same reason that slaves came out back in the day, right? So it’s not that you need an SLA to for how your communication should work, but setting up a communication communication framework that allows the day to work to understand how the business needs them to deliver everything, but then ensuring that data is at that table and having part of that conversation as well. So that way, they can help guide that strategy forward. And then when you have that communication set up between business and data, you’re going to end up having a data ops culture without realizing it, right? I think at the end of the day, I love the conversation about data ops becoming a facet of a mature organization. I really am starting to question whether the focus on creating data apps is creating too much pressure around having the right process in place. Whereas like, you should just take one step back where we all just took one step back and found a way to talk and communicate like that, data obstacle could come more naturally. As long as you set those expectations that you’re talking about, then that conversation can occur naturally and then now everyone’s communicating at the same table. You don’t have to worry whether you’re on the same page, just a matter sensor.
Gary I totally agree. I think a lot of organizations views ops as like a nerd checkbox in their bureaucratic process instead of like some. That helps them to kind of operate in a more efficient manner. And I think at the end of the day comes out of communication. It’s that you have a lot of organizations have their teams are too siloed. Your ops teams is too small for engineering teams to siphon your data team and having a strong communication line between the business data operation and so forth and having that culture is extremely important for for for having to your point, data ops kind of be built into the process instead of it being something that you’re forcing your engineers to kind of to adopt.
Harper Yeah, I love it, I love it, Gary. This is awesome. I can keep talking about this for another hour, probably. And I think that honor would keep asking Oz the great questions that made us actually explain things that we’re talking about so that everyone’s engaged in the same way. But I just want to thank you for coming on for and taking the time to chat with us. Love to have you back. Some time we can kind of figure out another time to do that, for sure. But other than that, thanks to everyone. I hope you all enjoyed the episode, honor. I’ll catch you again soon.
Honor Yeah, thanks. See you bye.
Gary Thanks for having me. It’s great. Thank you. Thanks, Gary.