Episode Transcript
Ryan Yackel: Hey, everyone, welcome to the Mad Data podcast. My name is Ryan Yackel, I’m one of the hosts and we also have Josh on the line as well. One of our other hosts, he’s the co-founder and CEO over at Databand. We also have Chad Sanderson, who’s the head of data platform over at Convoy. Chad, how are you doing, man?
Chad Sanderson: Doing well. Great to be here.
Ryan Yackel: Good, I know that right before this discussion, we were talking about your title of data platform, I know Josh you wanted to ask some questions about that.
Josh Benamram: Yeah, yeah, it’s interesting because we’re starting to see more specialization and nuance in the titles forming around data leaders that we talked to. So I’m curious, what does what is head of data platform mean for you at Convoy? What are your areas of ownership and responsibilities and how do you land on that title to describe what you do?
Chad Sanderson: Yeah. So data platform at Convoy refers to Convoys data infrastructure. So it’s all the tools the platforms, everything from ingestion to ETL to data warehouse to the way that our internal customers leverage data to make business decisions. So we own an experimentation platform, a machine learning pipeline, all of our BI and reporting tools. And the reason that the role emerged is because, teams realize that it wasn’t enough to just attribute all of our data infrastructure to IT spend and go through a checkbox making decisions on like what products we should buy and what we shouldn’t, it really needed some product thinking to come up with a plan and a strategy around how to evolve our data infrastructure to suit Convoy’s growing needs for machine learning and experimentation. So there’s a lot of things that we’ve bought, but we also build a lot of things to where we’re a bit ahead of the market and we use a lot of open- source projects as well. So my role is to help with the prioritization of those projects, to determine when we should build versus when we should buy and where we need to do something innovative.
Josh Benamram: Really interesting, thanks. Sometimes when we when we talk to folks in platform, it’s really common to hear that area of responsibility that you just described platform being the team that is responsible for the infrastructure within the data organization, all the different tools, services, solutions that people are using to ingest ETL, ELT, consume data. Sometimes, we’ll also see that platform on some level of the actual pipelines themselves, often on the ingestion side of things. So the second area of responsibility, often being platform, will also make sure that there’s some level of raw or prepared data that’s available to the analysts or data scientists, some data engineers downstream to begin doing their work. And this helps that area of the business not focus as much on the data prep, but more on the data product development. So I’m curious are are you are you exclusively on the infrastructure layer? Is there some level of the pipelines themselves that you own?
Chad Sanderson: We do. We take a similar approach where we are. Our goal is not to own any business logic. We want to stay abstracted away from that. But when it comes to like the processing of raw data, we do want to ensure that as new events land in the data warehouse that we have a set of automated systems which run that, like, you know, aggregates information as needed, parses it cleans the JSON up and makes it accessible to our downstream consumers so that they can plug it into a model or do analytics on it without having to go through the headache of tons of transformations themselves.
Josh Benamram: Interesting. And it is that line of ownership, something that just constantly shifts and moves across time?
Chad Sanderson: It doesn’t, actually, we have a we have a pretty clear delineation of what the data platform team owns and what the downstream product teams own. So historically it was pretty blurry. But the model that we’ve moved to quite recently actually is that at Convoy, we have a few different schemas. We have a schema for that we call landing zone, where basically broad JSON lands. We have another schema called source, where that raw JSON is transformed into some usable tables that are renamed the JSON is parsed. We own those two steps of the pipeline. Once it goes past that, it enters what we call sort of the data mart. So it enters the Kimball world, and it’s the responsibility of the product teams to organize that data into fact in dimension tables. And and and it’s really based on what data is coming from the services. So if a service is producing data for an entity like, let’s say, an auction, then that product engineering team owns the auction data. They own the creation of the dimensional models for auctions and all of the corresponding facts.
Josh Benamram: Really interesting. Thank you. Last question before we get to the topic today about it, at what point in Convoy’s growth did you start to feel the need for that separation of responsibility or data platform emerged?
Chad Sanderson: Yeah, so originally this was back in 2018. Early 2019, the data engineer was embedded on the product teams. And that didn’t go very well for a few reasons. I mean, one of the reasons was because there wasn’t a clear sort of management hierarchy. We didn’t have enough data engineers at the company so that they they could report up to a data engineering manager and the data engineer manager might report to the CTO. We didn’t really have that, the data engineers reported through the data science function. And those roles are just so different that oftentimes the data scientists and the analysts were being prioritized over what the data engineers needed and that that wasn’t good. The other problem was that the data engineers were often being treated kind of like SQL monkeys. Consumers were writing a bunch of SQL. They were building training data, they were doing experiments and things like that. And the data engineers were essentially told, “Hey, can you guys go in behind these folks and and clean the pipeline up and rebuild it and make sure that the SQL is scalable?” And and nobody really likes doing that work, especially if you’re an engineer and you like building things and you have a very, you know, rare skill set. So then we move to a centralized model where all the data engineers were centralized under one team, which was within the infrastructure organization, and we started supporting more tooling. But there were additional problems, right? Like the reason that the data engineers were embedded was because the data teams weren’t actually able to write scalable SQL on their own. And that created a whole bunch of different problems. So we kind of like went back and forth on this model for a bit until we decided on the one we have today.
Josh Benamram: Okay. It follows a lot of trends that we see. I’m I’m trying to get to like a rule of thumb, like if you’re a data team of 10 people, get some data platform folks start to do this operation, like where does the boundary tend to emerge? It’s going to feel different in different organizations, of course. But do you have a sort of rough benchmark like that that you would suggest to teams that are starting to scale quickly?
Chad Sanderson: Yeah, it’s like you said, I think it is going to be a bit sensitive to like how much data that do you have, how complex is the data? Convoy’s data is a reasonably complex. We have tons of different entities and lots of different real world events that are happening, and it’s very non-linear. So the modeling becomes incredibly, incredibly important. The architecture becomes incredibly important. And so for that, we probably lean a bit more heavily into the data engineering function than than other teams might at a similar stage.
Josh Benamram: OK, interesting. All right. Well, I think I’ll pause that thread now and we’ll save that for our next podcast and we can get back to our core topic today around that.
Ryan Yackel: All right, so so let’s talk about the main topic today. So Chad, when you and I talked, we came up with this topic of how to avoid a vortex of data debt. And you talked about that most companies face this technical data debt problem, and they’re not built for scale. And by the way, everyone should just keep this in mind that’s listening to this Chad post all the time on his LinkedIn, and he posts some really good content. So be sure to to follow him on on LinkedIn. And one of the things that I think we should start off with, maybe is Chad. Like, What’s your definition of bad data debt? I guess all debt is bad, but what’s your definition of what? What creates this data debt?
Chad Sanderson: Yeah, yeah. So two sort of parts of that question that was sort of my definition of what data debt is and then what creates it. I would say that data debt is really the the data warehouse equivalent of technical debt. When you have a build up of queries and SQL statements that are fundamentally unscalable that are propagating through the data warehouse at a rapid pace, which results in a lack of trust. And also a lack of usability and ownership. I would define that as the business has a data debt problem, meaning that in order for you to trust the data or to leverage that data effectively, you need to pay down the debt. And paying on the debt would require rebuilding those queries, potentially refactoring upstream sources going through a model, a modeling and data architecture effort. And a lot of businesses find themselves here, Convoy certainly found themselves there to different degrees. Some folks have it really bad where the data debt is spiraling out of control. Other folks don’t have it as the bad because they may be a bit earlier in their journey. And maybe they just got dbt. They just started hiring analysts. So the problem hasn’t really began to raise its head yet. In terms of what causes it. I think that there’s really two problems that end up emerging sort of upstream quality issues and downstream quality issues. The upstream quality issues often emerge in a ELT based environment when we are extracting data from first party sources, so like a production database, and S3, we’re piping that data into a data warehouse like Snowflake. And. The software engineer who owns that production database. Really is treating it. Like an implementation detail of their service. The data is not fundamentally intended for analytics or machine learning, and because it’s an implementation detail of the service, it means that the engineer has the right to change it at any time. If they have that right, then it means that they can and will change it any time. Data science and analyst hate dependencies on it, and they break, so that’s data quality, problem number one, that happens very, very frequently. The second data quality problem is based around the first. If you live in this world where the software engineering team isn’t really omitting the data that you need from your first party sources, it means the analysts and the data scientists and any other consumer. Is going to have to reverse engineer important business concepts using SQL. For example, at Convoy, we have a very important business concept called shipment categorization. You don’t need to know exactly what that is, but just know that our sales team thinks it’s really important. Our machine learning team for like pricing models think it’s really important. It’s a pretty important component of our of our business model. But the upstream services where this data could be collected doesn’t care about shipment categorization. It doesn’t need that information in order for the service to operate, and so we don’t record it. And that means that someone downstream is going to have to build a SQL table, a SQL file in order to capture this concept of shipment categorization. And that is really hard and it’s very, very complex. So what a data scientist basically has to do is look upstream. They have to figure out they have to grok all this information that’s happening in production. They have to understand the code pass. They have to understand a lot of different like state machines. Then they have to write the SQL, which is usually some combination of, like many, many joins, a bunch of IF Else statements. And then when they do that, if they do it successfully and oftentimes they they fail. They’re not software engineers. And so the code is not written in a very scalable way. There is no unit testing. There’s no documentation. And that very important business concept isn’t taken as a dependency by many other teams because it’s it’s critical, right? It’s something you want in your table or you want in your model or somewhere else. And these two issues then combine, so the upstream is changing all the time, the software engineers technically own the source data, but they’re they they there’s no contract between the producer and the consumer, so any time they want to change it, they can. And in the same time, you have all these downstream models being built with massive many, many different lines of SQL, tons of joins, low quality and they’re breaking all the time. And that creates debt.
Josh Benamram: Interesting, so it sounds like a lot of the root causes that you’re attributing to this creation of debt is the non optimal bad or lacking structure of data coming from upstream locations. You’re pointing a lot at the the software engineering side of the house that they’re not giving enough thought to how to properly structure data that’s going to flow into analytics and ML. And therefore that leaves analytics and ML with this snowball of complicated logic that they need to then build up, which is generally not optimal to run within the warehouse layer where it can be really expensive to do so. Am I playing that back?
Chad Sanderson: That’s exactly correct. Exactly correct. And the last the last problem that you pointed out is sort of the end result of all this, which is if it’s hard to understand, it’s hard to grok what any particular SQL Query is doing because it’s so complex, then it can oftentimes be easier for the data consumer to just build their own query. Or maybe it’s a very, very expensive query to run and everybody starts hitting it. And when that starts happening, the cost of transformations in the data warehouse starts going exponential, which is exactly the problem that the move to the cloud data warehouse was meant to solve. Right?
Josh Benamram: Yeah, I was going to I was going to ask you if I if I’m a snowflake salesperson, I’m telling you, don’t worry about it. You know, we’re infinitely scalable. We’ll we’ll we’ll perform. As your queries become more and more complicated, this is what we were designed for. So what’s your what’s your answer to that?
Ryan Yackel: I have a quick question for Chad, though this is a dumb question from a dumb marketers. I’m going to ask this question, but is there a similar thing going on between the way IBM kind of operates its MIPS mainframe sort of pricing model? Like with that same thing where basically you’re hitting the mainframe in every transaction or hitting whether it’s in a testing environment or production environment, you’re going to get billed every single time. And so is that is that, let’s say, like a legacy model from like a, I guess from a on prem thing is that same problem now coming to the cloud like we just talked about or is am I off?
Chad Sanderson: I think in some ways, yes, like I think in some ways it’s similar to what’s happening. Like we we have queries at Convoy that take 20 or 30 minutes to run just due to the complexity of those queries. And if your query takes that long and you’re hitting so many tables and there’s so much compute happening and those transformations are being run thousands or tens of thousands of times a day, you know the bill, the bill is going to start adding up. But I think there’s another cost problem sort of outside the pure volume of computations that are happening, which is the cost of the team that’s required to support that model. It becomes it becomes enormous. You need to start transitioning analysts who previously previously only really needed to know SQL. Now they need to know DBT. Right now, they need to become more like analytics engineers, which is the title that’s becoming more popular. They need to know how to use a command line. They need to become software engineers, and you need more and more and more of these people. Because as a data warehouse becomes more complex, a single analyst is not going to be able to handle on call for all of the models that their team owns. So there’s like a computation problem, a transformation issue. And then there’s also a people issue, which is one of the reasons why it is very challenging for an enterprise company to adopt tools like DBT. There is so much data, there’s so many. They already have a massive analyst population, and now they need to start retraining all of those analysts to have a totally different job. It’s just not going to fly for for a lot of those, a lot of those big companies, especially where this sort of semantic layer of metrics and features and dimensions and facts is incredibly critical to business operations. It can’t be wrong, and it can’t be. It can’t be late.
Josh Benamram: If I can play devil’s advocate for a second, so I’m the I’m the software engineer, right? Data is coming to me and saying, you know, you really need to make this data more usable by the downstream consumers. There’s too much work that’s getting pushed down to them and all these very legit reasons as to why this is accumulating debt in our warehouse and leading to. Giant bills from our warehouse providers, we really need this this data to be better structured from you as soon as you send it. Software Engineer says, You know, we went through this a year ago like data came to me and said we need more control over how this data is derived. We have all these new personnel now. We have this new analytics engineer who can go in and they can write more complex SQL. They can use better pipelining tools to own more of the logic. And you know, a year ago, data told me that I was constraining too much. So where is the balance right? How much does software own versus how much does data own? And granted, it’s going to take a while for the bigger teams to adopt tools like DBT. But you know these these services do help democratize democratize some level of the logic creation. So how would you know? How do you talk about how would you answer that kind of expected pushback from from software, from going through another kind of cycle of? All right, we’ll own more of the logic.
Chad Sanderson: I think what you need is both. I think that you you do need to know what the services are actually doing. Like that’s important. But you also, where possible, need to be emitting semantic information like real world events that are happening that are a very, very simple to join. And in my experience, it actually is the latter where data scientists are going to be spending the vast majority of their time. And that is pretty reflective of how data architecture is done. And what I would say, what I would call it, like legacy businesses. So if you look at sort of the Boeings of the world, the companies that probably haven’t made the move to Snowflake yet, traditional very traditional ETL models in that world, they usually have a very heavy layer of data governance data architecture that sits at the top of this entire system. They do all the transformations in advance, and they make sure that the data warehouse is a basically a one to one mapping to how the business actually works. What we realized sort of in the move to the cloud is like there does need to be a level of flexibility so that data science teams can still answer interesting questions like outside that model. But that model still needs to exist. Like essentially what we’ve done is we’ve cast that aside as we’ve moved to the cloud and said, we don’t really need this like, you know this, this map of the business anymore, we can do all the transformations in the data warehouse. And what we’ve experienced that Convoy anyway, is it that creates this like innumerable permutations of the exact same concepts was like slight, very, very slight differentiation where the SQL is not really written very well and it doesn’t scale very well. So I think that will be the argument. It’s like there is room for both. But that view of the world coming from services that maps out like this is what the business looks like. These are the entities that we care about. These are the real world events that are happening and here’s how entities are tied together. Like that should exist in some in some form.
Josh Benamram: OK, so I’ll call that the semantic layer. Tell me if I shouldn’t.
Chad Sanderson: I don’t know if I would call that the only. I would personally call it the semantic layer. But I think in the modern data stack, the semantic layer means something a little bit different. I think there’s there’s probably two layers at play here. The layer that I’m I’ve described just now, maybe I would call it the descriptive layer. So that’s that’s basically like, you know, let’s model out the entities and events that that sort of power our business. And then the semantic layer is now let’s transform that entity and event data into like logical constructs that we can use to make business decisions like margin as a metric. Is that an example? Like, that’s a simple concept.
Josh Benamram: OK, so you were using using that example, let’s say, were a accounting software. You would imagine the software engineering team owning something like the definition of the metric of revenue right? Software engineering sends down revenue and they send down cost of goods sold. Those are defined metrics. They’re kind of instantiated on the software house. Data analytics, engineering and data science are not going to change that too much. That’s the layer of description that that software owns as that comes down to the business, depending on what kind of financial analysis we want to do. Where things are definitions or maybe less immutable like it depends on what the priorities are. Are we optimizing towards increasing revenues in the company, decreasing churn, increasing profit. We may want to build a dashboard that has gross margin and does a lot of gross margin analysis and the definition of gross margin should be consistent more or less across the data organization, so in the semantic layer owned by the data team, we’re going to define gross margin as revenue minus cost of goods sold. And that’s kind of the the definition that everybody should use in data. Am I getting that?
Chad Sanderson: Exactly. Yup.
Josh Benamram: OK. Interesting. So going back to our earlier discussion around ownership, who in the data team owns that semantic layer? Who owns what gross margin means is that platform engineering analytics or some wave in between?
Chad Sanderson: Yeah, I see that being. And this sort of goes back a little to our to our organization. Our organizational model is it’s either a business, internal customer or its product. So if you’re talking about a financial concept, I think the ownership should lie with like the finance team. If you’re defining a concept. So in Convoy’s case, one of our core domains is like shipments. You know, we really care about shipments and how they move across the country and like shipment volume and shipment margin and things like that. So the team that owns the shipment service, which has all the source data for shipments, is probably going to define like what these shipment metrics are. And there’s always going to be some level of collaboration as these things change. Maybe your accounts team wants to like slightly change what an active shipment means or what an active shipper means. And that’s then a collaboration with with that team. But like the core concept of what is a shipment that should be owned by software engineering.
Josh Benamram: OK. Interesting to this layer that’s owned by data. I really agree with the observation that you had that as the cloud native warehouse came about and people care less about the structuring of data coming into it. We see more ELT happening. We see more free storage or very cheap storage happening, compute a little more democratized tooling, getting easier for folks to work with downstream. This this did kind of bring about, I think, more chaos in the semantic layer. I’m curious for that for for the definitions. I go there like what actually makes up the semantic layer. Where are there any interesting tools or anything that where you would want that define like how does Convoy manage that today? And that’s another thing that I’m seeing, I think, is that that layer being owned by something like look normal on the looker side of things, people wanting to push that more down to the warehouse and looking for some way to do that. So I’m curious what you’ve what you’ve seen a Convoy or what you felt? Yeah.
Chad Sanderson: So before there were too many products in this space, we built a metrics repository internally at Convoy to serve that function. And the way that repository works is it’s like CI/CD for metrics. Basically, you can define a SQL file, which is like a select statement and a Yammer file, which has some metadata about the metric goes to a PR review process. People sign off on it. That started off with our as like a part of our experimentation tool, which is very, very common because if you’re running experiments and like by definition, you’re going to need some standardized set of metrics that teams are probably going to reuse over and over and over again. So that’s where it started. But what we realized after we built that was we actually should just have a single surface for metric definition. And then we realized we should probably just have a single surface for definition of all of these semantic concepts. So we’re working with a third, a third party tool right now called Trace and Trace is providing that metrics layer as an API. So you have a nice interface. You log into it. You can create a new like metric definition or a or dimension. They will pipe all of that information into a tube. And then after it is sliced and diced every which way they provide an API, which you can then plug in to, you know, like a tableau or experimentation tool or any other tool that you that you want to use. The reason I like that model a lot is that it creates a very clear separation of concerns. If you’re a business consumer and you want a new metric and that’s like an aggregation of an event, you can request that a data person can go and implement it very, very easily. And if that if that definition ever changes, you have like a very nice history of how that metric changed over time and you create an API and it gets used, you know, when one of the only. Issues with Looker is that it’s sort of you’re consolidating these metric definitions within the look or platform, whereas there’s many use cases for metrics across a business, I think.
Josh Benamram: Yeah. So do you see that also as a really core ingredient to drawing down data debt?
Chad Sanderson: Yes. Yes. I think that that the the descriptive layer and the semantic layer are both very, very key in reducing that debt. The descriptive layer is basically solving the first problem that I mentioned where, you know, engineers are changing things and things are breaking and people aren’t getting the right data that they need to actually create those queries. And then the semantic layer is solving the problem of, Well, I don’t really trust this data because I don’t understand the SQL that’s written here. There’s not a lot of like governance or ownership. There’s no quality. And so I can’t trust it. And I think if you pair both of those together and there’s a nice relationship between the two of them, then that the data debt is going to be much is going to be radically lower.
Josh Benamram: Interesting. Is there so, so Trace would be the solution that you’re using for the semantic layer? Is there an analogy for software teams at the descriptive layer? Is there a Trace for data? For a data software team?
Chad Sanderson: There’s not that that is a gap. And so that is something that we’ve been working on internally, at Convoy we have we have been rolling this out over the past year and a half or so. And essentially, like all the pieces to do, this exists already. Like you’ve got protobuf for schema management and validation, you’ve got Kafka. So we have a we have a library. It’s like protobuf-ish, engineers can use it to define schema in their service. We stream the events via Kafka directly into Snowflake. Our data engineering team owns the parsing mechanism that I mentioned before. That essentially builds out a event table, and there’s a mechanism for data scientists and data consumers to define the data that they need as and when they need it in an iterative way. So if you were to say I have a, I would like to record a shipment canceled event that’s really important to me as a data person. We’re not capturing that. We’re not capturing that directly right now. I can I can ask for a shipment canceled event. It has the schema with these properties, and here’s a team that should own it. The engineering team goes and implements it. And then very, very quickly, I see that appeared in its own table and I can use it as a metric.
Josh Benamram: Do you have a name for that? So I don’t get calling it Trace for the descriptive layer.
Chad Sanderson: You know, I don’t actually.
Ryan Yackel: Let’s call it Chad’s Platform. There you go.
Chad Sanderson: Maybe I’ve been calling this whole concept the immutable data warehouse that the internal name for this, that Convoy is called Chassis. So maybe we can call it Chassis.
Josh Benamram: OK, OK, cool. So I’ll call it Chassis for now, just to save my word salad. But so with chassis, so what? What exactly? Because this is helping to move more of the structuring of of data entities into the software engineering or what? I’ll just say again, what exactly these software engineers, you know, I’m building an application. I’m rolling out a new feature. Let’s say I’m this accounting company and it’s a new feature to calculate the gross profit. I don’t know. It’s going to sound horrible gross profit within our product. What? How do I engage with Chassis as a software engineer to make sure that this data flows down the right way to the data org?
Chad Sanderson: Yes. So the the the core philosophy of this like view, like the core philosophy that we’re proposing, is that the software engineer should treat their service level data as a product, and they should consider the downstream consumers of that data as a customer. And that mean that necessitates an API. So it’s really a platform for designing data APIs and we call those data contracts. So what Chassis does is there’s a couple of things that it does. The first thing that it does is it’s a definition layer for the data team. So if you know that there are some new tool that’s or some new feature that’s being built that like records, I don’t know. It’s like a payment processing tool or something like what you mentioned. You might say, I want to know every time a customer completes, it completes a transaction. And I also want to know every time a transaction fails. I want to know every time a customer quit. Quit quits the application.
Josh Benamram: Sorry. This would be an analyst or a scientist coming to that, like opening a ticket, coming to this software engineer, maybe through a chassis and saying These are the these are the entities and I’m looking for these are the right.
Chad Sanderson: OK. All right. These are the things that I’m looking like. This is what I need to do to answer the business questions that I’ve been asked. And here’s the schema. And, you know, just properties. So. So that might be like, you know, for every, for every like payment canceled. I want to know like what was bought. I want to know, like the Item I.D., I want to know how much money it would have cost. So you’re omitting things directly from the event itself instead of having to join across like five or six or seven symbols. For that information, you’re specifying that in one place, you’re providing a surface for the engineers to review that with you and have conversations because like very frequently, sometimes you’re just not able to produce the things that that the data scientists want, like the service isn’t set up that way. Like, there’s some crazy logic happening or there’s some there’s some really weird, you know, like state machine going on. So it’s providing a place to have that conversation back and forth, deciding on what the actual event and schema should look like. And then we also provide a library for essentially creating those endpoints, and that’s using the protocol flag thing that I mentioned and Kafka. So the engineer goes and implements it, and then we can validate. We can essentially read from from what was defined in Chelsea and say, OK, this thing that you amended from your library, it matches the event as it was described in Chelsea. So we’re going to allow you to to make that change. And if it doesn’t, then we’re going to we can block you. Interesting. Yeah. And so. So there’s there’s there’s more pieces to it, but I would say that’s that’s the core the the the piece that we’re working on now, which we are still doing this like manually. It’s where the data engineers are getting involved is the mapping between what the data science team and analysts has defined and what the software engineering team like the schema that’s actually been implemented in production. Once that mapping happens, then you can automatically build the data warehouse. And in the ideal world, you can even define how you want those events to be aggregated, so you might say I have this new event called Customer Buys a thing. It’s a transactional event, and I want to add that as a column to a table that’s called transactional, that’s just called transactions, and it aggregates in like this way. And you can specify all that in one place. And because we have all that information, once the joints happen, we can use something like DVT and and do all the transforms automatically and build out data marts and drop them into snowflake.
Josh Benamram: Yeah, what’s interesting is it is, well, first of all, it’s just fascinating to hear about this new interface that you’re putting a lot of time into thinking about between software and data. Yes. We think a lot about this interface between data platform and data analytics and science. This is pushing it a little bit left, which is really cool. The the other thing that that comes to mind is just hearing about you talk about. It’s interesting also how it sounds like data is really in the driving seat. They’re determining what is important for the data organization in terms of where these definitions lie. So, so it sounds like, you know, software is is the provider, right? Data is is coming to them and saying this level of definitions, we’re going to keep in our semantic layer. This level, you know, people can free flow in their own analytics and business logic. But here’s what we need from you in order to make sure that we keep everything efficient downstream. But the question that comes to mind is like how this how this flows into the normal, the normal, I guess project management, like sprint management, we’re kind of a software team because you’re effectively adding a new customer to that. Right?
Chad Sanderson: Yeah, yeah, you are. You’re 100 percent correct. It is. You are adding a new customer and it is more work for the software engineer. You know, this is something that we we sort of went we we spent some time on when we were proposing this at Convoy initially when some engineers were like, You know, what’s the benefit for me to start doing this? Like, what’s my incentive as a software engineer to to like, you know, change, change, change my behavior and now I have a new set of API soon. And the reason that we we were able to push this through is because it is not actually a software engineering decision, it’s a business decision. And the way that we made our pitch to business was, look, you can have software engineers take on a little bit more ownership, which means they may be able to do a little bit less on the feature side. But the benefit is you could make, you know, tens to hundreds of millions of dollars downstream because your models get like radically better, you’re able to answer so many more questions that you couldn’t before and business people, you’re able to get your answers like much faster. So one example of a question that we’ve always really struggled to answer at Convoy had to do with the the history like the event history of any particular entity. So I mentioned this when we were talking earlier, but shipments go through a very non-linear business cycle. They start off as rookies, then the shipper size, if they to award, is that freight as a tender, it gets put on our marketplace. Carriers who are businesses that own trucks will bid on that. Then once they’re awarded the freight, they will take the shipment on a lane and sometimes things happen like a truck can break down. Sometimes things can be canceled and E.T.A. can get moved. And understanding exactly what happened through the lifecycle of any particular shipment is really, really hard in our preexisting data ecosystem. And because we didn’t know that we’re losing out on massive product opportunities like if you could identify, oh, wait a second in the Northeast, it seems like any time this very particular set of things happens, it leads to a shipment being late, and that leads to $30 million of lost margin every single year. That’s things we couldn’t even begin to answer questions like that because the query to produce that for a single shipment was incredibly, incredibly difficult. And just, you know, a data person first needed to write a query to say, All right, I’m here, I need to get a query for a very particular shipment, then I need to understand how that shipper or how that the carrier interfaced with our operations team. They’ve maybe sent some emails back and forth. They have to get that from a user interface. Then they have to figure out like how the. The shipper then interacted with our website to track the status and then tie it back into the shipment, and they weren’t able to model it directly because they’re lacking IDs. Spaghetti, it’s spaghetti and so you can you really can only do that type of analysis for like a single shipment at a time if you’re very, very motivated. So there’s a whole class of analytics and machine learning that just can’t happen in our in our in today’s world at Convoy. And this system is unlocking that
Josh Benamram: Very, very interesting. I think there is there’s some slice that actually is motivating to the software engineering organization, but I’m just thinking like how I would go to, you know, our software engineers and say, like, are we going to start doing this? And what would motivate them? I think it’s probably more in the stream of of product analytics that could be like a door opener here for I think certain teams are they’re getting like a lot of pushback from the the software org. You know, this would help us like if before this, the product analytics that’s coming into the data team is very raw and it takes a long time for the data organization to turn around new questions for the product org. Like how is a certain feature being used? What’s our utilization on x y z? That might be a good door opener to say, Well, if we have this more thought through layer where we can get better data directly from the product itself through software, I’ll be able to turn around these questions a lot faster to you. Maybe we bring this in and then that can sort of grow into other domains that are less related to the actual product, but makes total sense. So you’ve got to make the case to business first. And it’s it’s just interesting also to hear about how how these tradeoffs are probably talked about between shipping new features faster, which is like how most software teams operate versus getting more insights into the business and finding that balance.
Chad Sanderson: Exactly. And that’s sort of the the endless, the endless sort of tug of war that that’s happening today. But you know, the way the way I’ve been describing this a lot internally at Convoy is the that agile software development like went through a similar lifecycle where, you know. Before in, like early 2000s, late 90s, you kind of had a waterfall. And that was a safe way of deploying code, but it was really, really slow. There was tons of governance. And then you could move to a more agile methodology, but it was pretty unsafe. Stuff was going to break. You were going to accumulate tech debt until it came along and get said, All right, now, we’re going to give you the ability to like, you know, yeah, source control is really important. But when you can start versioning your code, you can start doing branches like fundamentally, you’re creating a light layer of governance. And with the addition of GitHub, then you’re bringing in more like collaboration and peer review, which are things that you needed in the waterfall world. It’s just you’re just sort of breaking that down to it to a lighter layer, right? Like you don’t you don’t need to have 20 people in the loop now that hits like every single, every single, like software architect. Now you only need like two or three that are on the team that understand how that service works. And I think that’s the type of thing that that we need for data like a system, a system of thinking like that that allows us to to to take this middle point between really hard core governance ETL and very, very loose ELT with basically no governance.
Josh Benamram: Yeah, interesting. OK, so the medicine that I’m pulling out of this to the the core problem here is if you’re a data team, draw down your your data debt by thinking about these two layers of definitions within your org. One being really close to the software team, the descriptive layer, second being independently owned within the data organization, being the semantic layer, and that’ll help to build more consistency across the metrics that you’re using, the data that you’re using and centralize more definitions. There’s less spaghetti chaotic code or SQL being written by folks across the org that that is the accumulation of this debt.
Chad Sanderson: Right. Yup. In summary.
Ryan Yackel: Well, well, man, we talked a lot of a lot of stuff to talk about. Yeah, yeah, well, there was this one and I was like, Listen, you guys, I will say I have one thing to add, though, that you guys all appreciate all the stuff you were talking about, Chad, about the freight systems and all that stuff like. So I definitely connected with that because back in two thousand eleven or twelve, I used to work for Macy’s and I was a software tester at Macy’s. And so I used to joke to people and it was in the big ticket delivery system. So I used to joke to people, Hey, if you didn’t get that furniture piece that you ordered, they got delivered through JDA. It’s probably my fault because I didn’t test it correctly because that and that thing was so complicated. Like, you know, like, you had a scanner, you’d scan the merchandise and merchandise would get put on a pallet. The pallet move over here. You had to make it look in the system, yet to schedule it in the standby system, it was like a billion steps to get something to your door. It’s like it was was crazy. So I feel all of those. Even though you’re very much in the weeds. I was having a little bit of flashbacks to my day of testing the big ticket delivery system over at Macy’s. So I think we’re going to wrap it up. This was really awesome talk. I think anyone that was listening to this right now is going to take away a lot of stuff that Josh kind of recapped Chad for you. I mean, how can people get in touch with you, man? Do you have a blog, LinkedIn?
Chad Sanderson: I do. You can look me up on both LinkedIn and also my email. My LinkedIn is just Chad-Sanderson and by email is [email protected], feel free to send over any any questions you have or if you want to connect. I love talking about these issues.
Ryan Yackel: Awesome, man. Yeah, I appreciate you sitting down with us on the Mad Data podcast. So if we can talk soon and see each other at next, the next data council conference, maybe. Who knows? But it was great to have you on the show, man.
Chad Sanderson: Thank you.
Ryan Yackel: Hey, everyone, welcome to the Mad Data podcast. My name is Ryan Yackel, I’m one of the hosts and we also have Josh on the line as well. One of our other hosts, he’s the co-founder and CEO over at Databand. We also have Chad Sanderson, who’s the head of data platform over at Convoy. Chad, how are you doing, man?
Chad Sanderson: Doing well. Great to be here.
Ryan Yackel: Good, I know that right before this discussion, we were talking about your title of data platform, I know Josh you wanted to ask some questions about that.
Josh Benamram: Yeah, yeah, it’s interesting because we’re starting to see more specialization and nuance in the titles forming around data leaders that we talked to. So I’m curious, what does what is head of data platform mean for you at Convoy? What are your areas of ownership and responsibilities and how do you land on that title to describe what you do?
Chad Sanderson: Yeah. So data platform at Convoy refers to Convoys data infrastructure. So it’s all the tools the platforms, everything from ingestion to ETL to data warehouse to the way that our internal customers leverage data to make business decisions. So we own an experimentation platform, a machine learning pipeline, all of our BI and reporting tools. And the reason that the role emerged is because, teams realize that it wasn’t enough to just attribute all of our data infrastructure to IT spend and go through a checkbox making decisions on like what products we should buy and what we shouldn’t, it really needed some product thinking to come up with a plan and a strategy around how to evolve our data infrastructure to suit Convoy’s growing needs for machine learning and experimentation. So there’s a lot of things that we’ve bought, but we also build a lot of things to where we’re a bit ahead of the market and we use a lot of open- source projects as well. So my role is to help with the prioritization of those projects, to determine when we should build versus when we should buy and where we need to do something innovative.
Josh Benamram: Really interesting, thanks. Sometimes when we when we talk to folks in platform, it’s really common to hear that area of responsibility that you just described platform being the team that is responsible for the infrastructure within the data organization, all the different tools, services, solutions that people are using to ingest ETL, ELT, consume data. Sometimes, we’ll also see that platform on some level of the actual pipelines themselves, often on the ingestion side of things. So the second area of responsibility, often being platform, will also make sure that there’s some level of raw or prepared data that’s available to the analysts or data scientists, some data engineers downstream to begin doing their work. And this helps that area of the business not focus as much on the data prep, but more on the data product development. So I’m curious are are you are you exclusively on the infrastructure layer? Is there some level of the pipelines themselves that you own?
Chad Sanderson: We do. We take a similar approach where we are. Our goal is not to own any business logic. We want to stay abstracted away from that. But when it comes to like the processing of raw data, we do want to ensure that as new events land in the data warehouse that we have a set of automated systems which run that, like, you know, aggregates information as needed, parses it cleans the JSON up and makes it accessible to our downstream consumers so that they can plug it into a model or do analytics on it without having to go through the headache of tons of transformations themselves.
Josh Benamram: Interesting. And it is that line of ownership, something that just constantly shifts and moves across time?
Chad Sanderson: It doesn’t, actually, we have a we have a pretty clear delineation of what the data platform team owns and what the downstream product teams own. So historically it was pretty blurry. But the model that we’ve moved to quite recently actually is that at Convoy, we have a few different schemas. We have a schema for that we call landing zone, where basically broad JSON lands. We have another schema called source, where that raw JSON is transformed into some usable tables that are renamed the JSON is parsed. We own those two steps of the pipeline. Once it goes past that, it enters what we call sort of the data mart. So it enters the Kimball world, and it’s the responsibility of the product teams to organize that data into fact in dimension tables. And and and it’s really based on what data is coming from the services. So if a service is producing data for an entity like, let’s say, an auction, then that product engineering team owns the auction data. They own the creation of the dimensional models for auctions and all of the corresponding facts.
Josh Benamram: Really interesting. Thank you. Last question before we get to the topic today about it, at what point in Convoy’s growth did you start to feel the need for that separation of responsibility or data platform emerged?
Chad Sanderson: Yeah, so originally this was back in 2018. Early 2019, the data engineer was embedded on the product teams. And that didn’t go very well for a few reasons. I mean, one of the reasons was because there wasn’t a clear sort of management hierarchy. We didn’t have enough data engineers at the company so that they they could report up to a data engineering manager and the data engineer manager might report to the CTO. We didn’t really have that, the data engineers reported through the data science function. And those roles are just so different that oftentimes the data scientists and the analysts were being prioritized over what the data engineers needed and that that wasn’t good. The other problem was that the data engineers were often being treated kind of like SQL monkeys. Consumers were writing a bunch of SQL. They were building training data, they were doing experiments and things like that. And the data engineers were essentially told, “Hey, can you guys go in behind these folks and and clean the pipeline up and rebuild it and make sure that the SQL is scalable?” And and nobody really likes doing that work, especially if you’re an engineer and you like building things and you have a very, you know, rare skill set. So then we move to a centralized model where all the data engineers were centralized under one team, which was within the infrastructure organization, and we started supporting more tooling. But there were additional problems, right? Like the reason that the data engineers were embedded was because the data teams weren’t actually able to write scalable SQL on their own. And that created a whole bunch of different problems. So we kind of like went back and forth on this model for a bit until we decided on the one we have today.
Josh Benamram: Okay. It follows a lot of trends that we see. I’m I’m trying to get to like a rule of thumb, like if you’re a data team of 10 people, get some data platform folks start to do this operation, like where does the boundary tend to emerge? It’s going to feel different in different organizations, of course. But do you have a sort of rough benchmark like that that you would suggest to teams that are starting to scale quickly?
Chad Sanderson: Yeah, it’s like you said, I think it is going to be a bit sensitive to like how much data that do you have, how complex is the data? Convoy’s data is a reasonably complex. We have tons of different entities and lots of different real world events that are happening, and it’s very non-linear. So the modeling becomes incredibly, incredibly important. The architecture becomes incredibly important. And so for that, we probably lean a bit more heavily into the data engineering function than than other teams might at a similar stage.
Josh Benamram: OK, interesting. All right. Well, I think I’ll pause that thread now and we’ll save that for our next podcast and we can get back to our core topic today around that.
Ryan Yackel: All right, so so let’s talk about the main topic today. So Chad, when you and I talked, we came up with this topic of how to avoid a vortex of data debt. And you talked about that most companies face this technical data debt problem, and they’re not built for scale. And by the way, everyone should just keep this in mind that’s listening to this Chad post all the time on his LinkedIn, and he posts some really good content. So be sure to to follow him on on LinkedIn. And one of the things that I think we should start off with, maybe is Chad. Like, What’s your definition of bad data debt? I guess all debt is bad, but what’s your definition of what? What creates this data debt?
Chad Sanderson: Yeah, yeah. So two sort of parts of that question that was sort of my definition of what data debt is and then what creates it. I would say that data debt is really the the data warehouse equivalent of technical debt. When you have a build up of queries and SQL statements that are fundamentally unscalable that are propagating through the data warehouse at a rapid pace, which results in a lack of trust. And also a lack of usability and ownership. I would define that as the business has a data debt problem, meaning that in order for you to trust the data or to leverage that data effectively, you need to pay down the debt. And paying on the debt would require rebuilding those queries, potentially refactoring upstream sources going through a model, a modeling and data architecture effort. And a lot of businesses find themselves here, Convoy certainly found themselves there to different degrees. Some folks have it really bad where the data debt is spiraling out of control. Other folks don’t have it as the bad because they may be a bit earlier in their journey. And maybe they just got dbt. They just started hiring analysts. So the problem hasn’t really began to raise its head yet. In terms of what causes it. I think that there’s really two problems that end up emerging sort of upstream quality issues and downstream quality issues. The upstream quality issues often emerge in a ELT based environment when we are extracting data from first party sources, so like a production database, and S3, we’re piping that data into a data warehouse like Snowflake. And. The software engineer who owns that production database. Really is treating it. Like an implementation detail of their service. The data is not fundamentally intended for analytics or machine learning, and because it’s an implementation detail of the service, it means that the engineer has the right to change it at any time. If they have that right, then it means that they can and will change it any time. Data science and analyst hate dependencies on it, and they break, so that’s data quality, problem number one, that happens very, very frequently. The second data quality problem is based around the first. If you live in this world where the software engineering team isn’t really omitting the data that you need from your first party sources, it means the analysts and the data scientists and any other consumer. Is going to have to reverse engineer important business concepts using SQL. For example, at Convoy, we have a very important business concept called shipment categorization. You don’t need to know exactly what that is, but just know that our sales team thinks it’s really important. Our machine learning team for like pricing models think it’s really important. It’s a pretty important component of our of our business model. But the upstream services where this data could be collected doesn’t care about shipment categorization. It doesn’t need that information in order for the service to operate, and so we don’t record it. And that means that someone downstream is going to have to build a SQL table, a SQL file in order to capture this concept of shipment categorization. And that is really hard and it’s very, very complex. So what a data scientist basically has to do is look upstream. They have to figure out they have to grok all this information that’s happening in production. They have to understand the code pass. They have to understand a lot of different like state machines. Then they have to write the SQL, which is usually some combination of, like many, many joins, a bunch of IF Else statements. And then when they do that, if they do it successfully and oftentimes they they fail. They’re not software engineers. And so the code is not written in a very scalable way. There is no unit testing. There’s no documentation. And that very important business concept isn’t taken as a dependency by many other teams because it’s it’s critical, right? It’s something you want in your table or you want in your model or somewhere else. And these two issues then combine, so the upstream is changing all the time, the software engineers technically own the source data, but they’re they they there’s no contract between the producer and the consumer, so any time they want to change it, they can. And in the same time, you have all these downstream models being built with massive many, many different lines of SQL, tons of joins, low quality and they’re breaking all the time. And that creates debt.
Josh Benamram: Interesting, so it sounds like a lot of the root causes that you’re attributing to this creation of debt is the non optimal bad or lacking structure of data coming from upstream locations. You’re pointing a lot at the the software engineering side of the house that they’re not giving enough thought to how to properly structure data that’s going to flow into analytics and ML. And therefore that leaves analytics and ML with this snowball of complicated logic that they need to then build up, which is generally not optimal to run within the warehouse layer where it can be really expensive to do so. Am I playing that back?
Chad Sanderson: That’s exactly correct. Exactly correct. And the last the last problem that you pointed out is sort of the end result of all this, which is if it’s hard to understand, it’s hard to grok what any particular SQL Query is doing because it’s so complex, then it can oftentimes be easier for the data consumer to just build their own query. Or maybe it’s a very, very expensive query to run and everybody starts hitting it. And when that starts happening, the cost of transformations in the data warehouse starts going exponential, which is exactly the problem that the move to the cloud data warehouse was meant to solve. Right?
Josh Benamram: Yeah, I was going to I was going to ask you if I if I’m a snowflake salesperson, I’m telling you, don’t worry about it. You know, we’re infinitely scalable. We’ll we’ll we’ll perform. As your queries become more and more complicated, this is what we were designed for. So what’s your what’s your answer to that?
Ryan Yackel: I have a quick question for Chad, though this is a dumb question from a dumb marketers. I’m going to ask this question, but is there a similar thing going on between the way IBM kind of operates its MIPS mainframe sort of pricing model? Like with that same thing where basically you’re hitting the mainframe in every transaction or hitting whether it’s in a testing environment or production environment, you’re going to get billed every single time. And so is that is that, let’s say, like a legacy model from like a, I guess from a on prem thing is that same problem now coming to the cloud like we just talked about or is am I off?
Chad Sanderson: I think in some ways, yes, like I think in some ways it’s similar to what’s happening. Like we we have queries at Convoy that take 20 or 30 minutes to run just due to the complexity of those queries. And if your query takes that long and you’re hitting so many tables and there’s so much compute happening and those transformations are being run thousands or tens of thousands of times a day, you know the bill, the bill is going to start adding up. But I think there’s another cost problem sort of outside the pure volume of computations that are happening, which is the cost of the team that’s required to support that model. It becomes it becomes enormous. You need to start transitioning analysts who previously previously only really needed to know SQL. Now they need to know DBT. Right now, they need to become more like analytics engineers, which is the title that’s becoming more popular. They need to know how to use a command line. They need to become software engineers, and you need more and more and more of these people. Because as a data warehouse becomes more complex, a single analyst is not going to be able to handle on call for all of the models that their team owns. So there’s like a computation problem, a transformation issue. And then there’s also a people issue, which is one of the reasons why it is very challenging for an enterprise company to adopt tools like DBT. There is so much data, there’s so many. They already have a massive analyst population, and now they need to start retraining all of those analysts to have a totally different job. It’s just not going to fly for for a lot of those, a lot of those big companies, especially where this sort of semantic layer of metrics and features and dimensions and facts is incredibly critical to business operations. It can’t be wrong, and it can’t be. It can’t be late.
Josh Benamram: If I can play devil’s advocate for a second, so I’m the I’m the software engineer, right? Data is coming to me and saying, you know, you really need to make this data more usable by the downstream consumers. There’s too much work that’s getting pushed down to them and all these very legit reasons as to why this is accumulating debt in our warehouse and leading to. Giant bills from our warehouse providers, we really need this this data to be better structured from you as soon as you send it. Software Engineer says, You know, we went through this a year ago like data came to me and said we need more control over how this data is derived. We have all these new personnel now. We have this new analytics engineer who can go in and they can write more complex SQL. They can use better pipelining tools to own more of the logic. And you know, a year ago, data told me that I was constraining too much. So where is the balance right? How much does software own versus how much does data own? And granted, it’s going to take a while for the bigger teams to adopt tools like DBT. But you know these these services do help democratize democratize some level of the logic creation. So how would you know? How do you talk about how would you answer that kind of expected pushback from from software, from going through another kind of cycle of? All right, we’ll own more of the logic.
Chad Sanderson: I think what you need is both. I think that you you do need to know what the services are actually doing. Like that’s important. But you also, where possible, need to be emitting semantic information like real world events that are happening that are a very, very simple to join. And in my experience, it actually is the latter where data scientists are going to be spending the vast majority of their time. And that is pretty reflective of how data architecture is done. And what I would say, what I would call it, like legacy businesses. So if you look at sort of the Boeings of the world, the companies that probably haven’t made the move to Snowflake yet, traditional very traditional ETL models in that world, they usually have a very heavy layer of data governance data architecture that sits at the top of this entire system. They do all the transformations in advance, and they make sure that the data warehouse is a basically a one to one mapping to how the business actually works. What we realized sort of in the move to the cloud is like there does need to be a level of flexibility so that data science teams can still answer interesting questions like outside that model. But that model still needs to exist. Like essentially what we’ve done is we’ve cast that aside as we’ve moved to the cloud and said, we don’t really need this like, you know this, this map of the business anymore, we can do all the transformations in the data warehouse. And what we’ve experienced that Convoy anyway, is it that creates this like innumerable permutations of the exact same concepts was like slight, very, very slight differentiation where the SQL is not really written very well and it doesn’t scale very well. So I think that will be the argument. It’s like there is room for both. But that view of the world coming from services that maps out like this is what the business looks like. These are the entities that we care about. These are the real world events that are happening and here’s how entities are tied together. Like that should exist in some in some form.
Josh Benamram: OK, so I’ll call that the semantic layer. Tell me if I shouldn’t.
Chad Sanderson: I don’t know if I would call that the only. I would personally call it the semantic layer. But I think in the modern data stack, the semantic layer means something a little bit different. I think there’s there’s probably two layers at play here. The layer that I’m I’ve described just now, maybe I would call it the descriptive layer. So that’s that’s basically like, you know, let’s model out the entities and events that that sort of power our business. And then the semantic layer is now let’s transform that entity and event data into like logical constructs that we can use to make business decisions like margin as a metric. Is that an example? Like, that’s a simple concept.
Josh Benamram: OK, so you were using using that example, let’s say, were a accounting software. You would imagine the software engineering team owning something like the definition of the metric of revenue right? Software engineering sends down revenue and they send down cost of goods sold. Those are defined metrics. They’re kind of instantiated on the software house. Data analytics, engineering and data science are not going to change that too much. That’s the layer of description that that software owns as that comes down to the business, depending on what kind of financial analysis we want to do. Where things are definitions or maybe less immutable like it depends on what the priorities are. Are we optimizing towards increasing revenues in the company, decreasing churn, increasing profit. We may want to build a dashboard that has gross margin and does a lot of gross margin analysis and the definition of gross margin should be consistent more or less across the data organization, so in the semantic layer owned by the data team, we’re going to define gross margin as revenue minus cost of goods sold. And that’s kind of the the definition that everybody should use in data. Am I getting that?
Chad Sanderson: Exactly. Yup.
Josh Benamram: OK. Interesting. So going back to our earlier discussion around ownership, who in the data team owns that semantic layer? Who owns what gross margin means is that platform engineering analytics or some wave in between?
Chad Sanderson: Yeah, I see that being. And this sort of goes back a little to our to our organization. Our organizational model is it’s either a business, internal customer or its product. So if you’re talking about a financial concept, I think the ownership should lie with like the finance team. If you’re defining a concept. So in Convoy’s case, one of our core domains is like shipments. You know, we really care about shipments and how they move across the country and like shipment volume and shipment margin and things like that. So the team that owns the shipment service, which has all the source data for shipments, is probably going to define like what these shipment metrics are. And there’s always going to be some level of collaboration as these things change. Maybe your accounts team wants to like slightly change what an active shipment means or what an active shipper means. And that’s then a collaboration with with that team. But like the core concept of what is a shipment that should be owned by software engineering.
Josh Benamram: OK. Interesting to this layer that’s owned by data. I really agree with the observation that you had that as the cloud native warehouse came about and people care less about the structuring of data coming into it. We see more ELT happening. We see more free storage or very cheap storage happening, compute a little more democratized tooling, getting easier for folks to work with downstream. This this did kind of bring about, I think, more chaos in the semantic layer. I’m curious for that for for the definitions. I go there like what actually makes up the semantic layer. Where are there any interesting tools or anything that where you would want that define like how does Convoy manage that today? And that’s another thing that I’m seeing, I think, is that that layer being owned by something like look normal on the looker side of things, people wanting to push that more down to the warehouse and looking for some way to do that. So I’m curious what you’ve what you’ve seen a Convoy or what you felt? Yeah.
Chad Sanderson: So before there were too many products in this space, we built a metrics repository internally at Convoy to serve that function. And the way that repository works is it’s like CI/CD for metrics. Basically, you can define a SQL file, which is like a select statement and a Yammer file, which has some metadata about the metric goes to a PR review process. People sign off on it. That started off with our as like a part of our experimentation tool, which is very, very common because if you’re running experiments and like by definition, you’re going to need some standardized set of metrics that teams are probably going to reuse over and over and over again. So that’s where it started. But what we realized after we built that was we actually should just have a single surface for metric definition. And then we realized we should probably just have a single surface for definition of all of these semantic concepts. So we’re working with a third, a third party tool right now called Trace and Trace is providing that metrics layer as an API. So you have a nice interface. You log into it. You can create a new like metric definition or a or dimension. They will pipe all of that information into a tube. And then after it is sliced and diced every which way they provide an API, which you can then plug in to, you know, like a tableau or experimentation tool or any other tool that you that you want to use. The reason I like that model a lot is that it creates a very clear separation of concerns. If you’re a business consumer and you want a new metric and that’s like an aggregation of an event, you can request that a data person can go and implement it very, very easily. And if that if that definition ever changes, you have like a very nice history of how that metric changed over time and you create an API and it gets used, you know, when one of the only. Issues with Looker is that it’s sort of you’re consolidating these metric definitions within the look or platform, whereas there’s many use cases for metrics across a business, I think.
Josh Benamram: Yeah. So do you see that also as a really core ingredient to drawing down data debt?
Chad Sanderson: Yes. Yes. I think that that the the descriptive layer and the semantic layer are both very, very key in reducing that debt. The descriptive layer is basically solving the first problem that I mentioned where, you know, engineers are changing things and things are breaking and people aren’t getting the right data that they need to actually create those queries. And then the semantic layer is solving the problem of, Well, I don’t really trust this data because I don’t understand the SQL that’s written here. There’s not a lot of like governance or ownership. There’s no quality. And so I can’t trust it. And I think if you pair both of those together and there’s a nice relationship between the two of them, then that the data debt is going to be much is going to be radically lower.
Josh Benamram: Interesting. Is there so, so Trace would be the solution that you’re using for the semantic layer? Is there an analogy for software teams at the descriptive layer? Is there a Trace for data? For a data software team?
Chad Sanderson: There’s not that that is a gap. And so that is something that we’ve been working on internally, at Convoy we have we have been rolling this out over the past year and a half or so. And essentially, like all the pieces to do, this exists already. Like you’ve got protobuf for schema management and validation, you’ve got Kafka. So we have a we have a library. It’s like protobuf-ish, engineers can use it to define schema in their service. We stream the events via Kafka directly into Snowflake. Our data engineering team owns the parsing mechanism that I mentioned before. That essentially builds out a event table, and there’s a mechanism for data scientists and data consumers to define the data that they need as and when they need it in an iterative way. So if you were to say I have a, I would like to record a shipment canceled event that’s really important to me as a data person. We’re not capturing that. We’re not capturing that directly right now. I can I can ask for a shipment canceled event. It has the schema with these properties, and here’s a team that should own it. The engineering team goes and implements it. And then very, very quickly, I see that appeared in its own table and I can use it as a metric.
Josh Benamram: Do you have a name for that? So I don’t get calling it Trace for the descriptive layer.
Chad Sanderson: You know, I don’t actually.
Ryan Yackel: Let’s call it Chad’s Platform. There you go.
Chad Sanderson: Maybe I’ve been calling this whole concept the immutable data warehouse that the internal name for this, that Convoy is called Chassis. So maybe we can call it Chassis.
Josh Benamram: OK, OK, cool. So I’ll call it Chassis for now, just to save my word salad. But so with chassis, so what? What exactly? Because this is helping to move more of the structuring of of data entities into the software engineering or what? I’ll just say again, what exactly these software engineers, you know, I’m building an application. I’m rolling out a new feature. Let’s say I’m this accounting company and it’s a new feature to calculate the gross profit. I don’t know. It’s going to sound horrible gross profit within our product. What? How do I engage with Chassis as a software engineer to make sure that this data flows down the right way to the data org?
Chad Sanderson: Yes. So the the the core philosophy of this like view, like the core philosophy that we’re proposing, is that the software engineer should treat their service level data as a product, and they should consider the downstream consumers of that data as a customer. And that mean that necessitates an API. So it’s really a platform for designing data APIs and we call those data contracts. So what Chassis does is there’s a couple of things that it does. The first thing that it does is it’s a definition layer for the data team. So if you know that there are some new tool that’s or some new feature that’s being built that like records, I don’t know. It’s like a payment processing tool or something like what you mentioned. You might say, I want to know every time a customer completes, it completes a transaction. And I also want to know every time a transaction fails. I want to know every time a customer quit. Quit quits the application.
Josh Benamram: Sorry. This would be an analyst or a scientist coming to that, like opening a ticket, coming to this software engineer, maybe through a chassis and saying These are the these are the entities and I’m looking for these are the right.
Chad Sanderson: OK. All right. These are the things that I’m looking like. This is what I need to do to answer the business questions that I’ve been asked. And here’s the schema. And, you know, just properties. So. So that might be like, you know, for every, for every like payment canceled. I want to know like what was bought. I want to know, like the Item I.D., I want to know how much money it would have cost. So you’re omitting things directly from the event itself instead of having to join across like five or six or seven symbols. For that information, you’re specifying that in one place, you’re providing a surface for the engineers to review that with you and have conversations because like very frequently, sometimes you’re just not able to produce the things that that the data scientists want, like the service isn’t set up that way. Like, there’s some crazy logic happening or there’s some there’s some really weird, you know, like state machine going on. So it’s providing a place to have that conversation back and forth, deciding on what the actual event and schema should look like. And then we also provide a library for essentially creating those endpoints, and that’s using the protocol flag thing that I mentioned and Kafka. So the engineer goes and implements it, and then we can validate. We can essentially read from from what was defined in Chelsea and say, OK, this thing that you amended from your library, it matches the event as it was described in Chelsea. So we’re going to allow you to to make that change. And if it doesn’t, then we’re going to we can block you. Interesting. Yeah. And so. So there’s there’s there’s more pieces to it, but I would say that’s that’s the core the the the piece that we’re working on now, which we are still doing this like manually. It’s where the data engineers are getting involved is the mapping between what the data science team and analysts has defined and what the software engineering team like the schema that’s actually been implemented in production. Once that mapping happens, then you can automatically build the data warehouse. And in the ideal world, you can even define how you want those events to be aggregated, so you might say I have this new event called Customer Buys a thing. It’s a transactional event, and I want to add that as a column to a table that’s called transactional, that’s just called transactions, and it aggregates in like this way. And you can specify all that in one place. And because we have all that information, once the joints happen, we can use something like DVT and and do all the transforms automatically and build out data marts and drop them into snowflake.
Josh Benamram: Yeah, what’s interesting is it is, well, first of all, it’s just fascinating to hear about this new interface that you’re putting a lot of time into thinking about between software and data. Yes. We think a lot about this interface between data platform and data analytics and science. This is pushing it a little bit left, which is really cool. The the other thing that that comes to mind is just hearing about you talk about. It’s interesting also how it sounds like data is really in the driving seat. They’re determining what is important for the data organization in terms of where these definitions lie. So, so it sounds like, you know, software is is the provider, right? Data is is coming to them and saying this level of definitions, we’re going to keep in our semantic layer. This level, you know, people can free flow in their own analytics and business logic. But here’s what we need from you in order to make sure that we keep everything efficient downstream. But the question that comes to mind is like how this how this flows into the normal, the normal, I guess project management, like sprint management, we’re kind of a software team because you’re effectively adding a new customer to that. Right?
Chad Sanderson: Yeah, yeah, you are. You’re 100 percent correct. It is. You are adding a new customer and it is more work for the software engineer. You know, this is something that we we sort of went we we spent some time on when we were proposing this at Convoy initially when some engineers were like, You know, what’s the benefit for me to start doing this? Like, what’s my incentive as a software engineer to to like, you know, change, change, change my behavior and now I have a new set of API soon. And the reason that we we were able to push this through is because it is not actually a software engineering decision, it’s a business decision. And the way that we made our pitch to business was, look, you can have software engineers take on a little bit more ownership, which means they may be able to do a little bit less on the feature side. But the benefit is you could make, you know, tens to hundreds of millions of dollars downstream because your models get like radically better, you’re able to answer so many more questions that you couldn’t before and business people, you’re able to get your answers like much faster. So one example of a question that we’ve always really struggled to answer at Convoy had to do with the the history like the event history of any particular entity. So I mentioned this when we were talking earlier, but shipments go through a very non-linear business cycle. They start off as rookies, then the shipper size, if they to award, is that freight as a tender, it gets put on our marketplace. Carriers who are businesses that own trucks will bid on that. Then once they’re awarded the freight, they will take the shipment on a lane and sometimes things happen like a truck can break down. Sometimes things can be canceled and E.T.A. can get moved. And understanding exactly what happened through the lifecycle of any particular shipment is really, really hard in our preexisting data ecosystem. And because we didn’t know that we’re losing out on massive product opportunities like if you could identify, oh, wait a second in the Northeast, it seems like any time this very particular set of things happens, it leads to a shipment being late, and that leads to $30 million of lost margin every single year. That’s things we couldn’t even begin to answer questions like that because the query to produce that for a single shipment was incredibly, incredibly difficult. And just, you know, a data person first needed to write a query to say, All right, I’m here, I need to get a query for a very particular shipment, then I need to understand how that shipper or how that the carrier interfaced with our operations team. They’ve maybe sent some emails back and forth. They have to get that from a user interface. Then they have to figure out like how the. The shipper then interacted with our website to track the status and then tie it back into the shipment, and they weren’t able to model it directly because they’re lacking IDs. Spaghetti, it’s spaghetti and so you can you really can only do that type of analysis for like a single shipment at a time if you’re very, very motivated. So there’s a whole class of analytics and machine learning that just can’t happen in our in our in today’s world at Convoy. And this system is unlocking that
Josh Benamram: Very, very interesting. I think there is there’s some slice that actually is motivating to the software engineering organization, but I’m just thinking like how I would go to, you know, our software engineers and say, like, are we going to start doing this? And what would motivate them? I think it’s probably more in the stream of of product analytics that could be like a door opener here for I think certain teams are they’re getting like a lot of pushback from the the software org. You know, this would help us like if before this, the product analytics that’s coming into the data team is very raw and it takes a long time for the data organization to turn around new questions for the product org. Like how is a certain feature being used? What’s our utilization on x y z? That might be a good door opener to say, Well, if we have this more thought through layer where we can get better data directly from the product itself through software, I’ll be able to turn around these questions a lot faster to you. Maybe we bring this in and then that can sort of grow into other domains that are less related to the actual product, but makes total sense. So you’ve got to make the case to business first. And it’s it’s just interesting also to hear about how how these tradeoffs are probably talked about between shipping new features faster, which is like how most software teams operate versus getting more insights into the business and finding that balance.
Chad Sanderson: Exactly. And that’s sort of the the endless, the endless sort of tug of war that that’s happening today. But you know, the way the way I’ve been describing this a lot internally at Convoy is the that agile software development like went through a similar lifecycle where, you know. Before in, like early 2000s, late 90s, you kind of had a waterfall. And that was a safe way of deploying code, but it was really, really slow. There was tons of governance. And then you could move to a more agile methodology, but it was pretty unsafe. Stuff was going to break. You were going to accumulate tech debt until it came along and get said, All right, now, we’re going to give you the ability to like, you know, yeah, source control is really important. But when you can start versioning your code, you can start doing branches like fundamentally, you’re creating a light layer of governance. And with the addition of GitHub, then you’re bringing in more like collaboration and peer review, which are things that you needed in the waterfall world. It’s just you’re just sort of breaking that down to it to a lighter layer, right? Like you don’t you don’t need to have 20 people in the loop now that hits like every single, every single, like software architect. Now you only need like two or three that are on the team that understand how that service works. And I think that’s the type of thing that that we need for data like a system, a system of thinking like that that allows us to to to take this middle point between really hard core governance ETL and very, very loose ELT with basically no governance.
Josh Benamram: Yeah, interesting. OK, so the medicine that I’m pulling out of this to the the core problem here is if you’re a data team, draw down your your data debt by thinking about these two layers of definitions within your org. One being really close to the software team, the descriptive layer, second being independently owned within the data organization, being the semantic layer, and that’ll help to build more consistency across the metrics that you’re using, the data that you’re using and centralize more definitions. There’s less spaghetti chaotic code or SQL being written by folks across the org that that is the accumulation of this debt.
Chad Sanderson: Right. Yup. In summary.
Ryan Yackel: Well, well, man, we talked a lot of a lot of stuff to talk about. Yeah, yeah, well, there was this one and I was like, Listen, you guys, I will say I have one thing to add, though, that you guys all appreciate all the stuff you were talking about, Chad, about the freight systems and all that stuff like. So I definitely connected with that because back in two thousand eleven or twelve, I used to work for Macy’s and I was a software tester at Macy’s. And so I used to joke to people and it was in the big ticket delivery system. So I used to joke to people, Hey, if you didn’t get that furniture piece that you ordered, they got delivered through JDA. It’s probably my fault because I didn’t test it correctly because that and that thing was so complicated. Like, you know, like, you had a scanner, you’d scan the merchandise and merchandise would get put on a pallet. The pallet move over here. You had to make it look in the system, yet to schedule it in the standby system, it was like a billion steps to get something to your door. It’s like it was was crazy. So I feel all of those. Even though you’re very much in the weeds. I was having a little bit of flashbacks to my day of testing the big ticket delivery system over at Macy’s. So I think we’re going to wrap it up. This was really awesome talk. I think anyone that was listening to this right now is going to take away a lot of stuff that Josh kind of recapped Chad for you. I mean, how can people get in touch with you, man? Do you have a blog, LinkedIn?
Chad Sanderson: I do. You can look me up on both LinkedIn and also my email. My LinkedIn is just Chad-Sanderson and by email is [email protected], feel free to send over any any questions you have or if you want to connect. I love talking about these issues.
Ryan Yackel: Awesome, man. Yeah, I appreciate you sitting down with us on the Mad Data podcast. So if we can talk soon and see each other at next, the next data council conference, maybe. Who knows? But it was great to have you on the show, man.
Chad Sanderson: Thank you.