How Engineering Can Deliver Better Value For Data Science

Neelesh Salian is a Staff Software Engineer at dbt Labs, creators, and maintainers of dbt. Neelesh stopped by the MAD Data Studio to talk about how engineering can deliver better value for data science. Get in touch with him on LinkedIn or check out his Substack – hysterical.substack.com

How Engineering Can Deliver Better Value For Data Science

About Our Guests

Neelesh Salian

Staff Software Engineer dbt Labs

Neelesh Salian is an engineer by day at dbt Labs and a writer by afternoon. Previously, he worked at Stitch Fix, building a data platform for data scientists. Before that, he worked at Cloudera where he worked with Apache projects like YARN, Spark, and Kafka. He holds a master’s in computer science from North Carolina State University, focusing on cloud computing, and a bachelor’s degree in engineering from the University of Mumbai, India.

Episode Transcript

Ryan: Hey everyone, welcome back to the MAD Data Podcast. My name is Ryan Yackel and also with Josh over at Databand, we have a very special guest here, Neelesh. He is the staff software engineer over at dbt Labs. How are you doing today and Neelesh?
Neelesh: Thanks for having me, Ryan. It’s good to be here.
Ryan: This is like your second podcast in, like, two weeks, right? You’re like a popular guy now over on LinkedIn.
Neelesh: Nearly two days. Like, I recorded this yesterday. We did a live with this young data guy. So that was fun.
Ryan: Yeah, at this rate man, you’re going to be all famous all over LinkedIn. So that’s how you roll these podcasts. You know, just get more and more podcasts. So you get on it so it gets you reach but, Josh how’re you doing today?
Josh: Doing well, excited to have Neelesh on the podcast. I think there’s a lot of interesting topics to go through today and he’s had a pretty interesting travels in his career in the data space. So excited to dive in.
Ryan: Yes. So Neelesh we talked on our prep call, we were the main title is really around how did engineering can deliver better value for data science teams? You’ve been able to work in engineering for a while and obviously data science being one of your kind of end customers, so to say. But before we get to that, I did want to ask you about kind of your background and how you got to dbt Labs today and maybe some advice you have for people that are coming up through the same path that you’re on right now?
Neelesh: Yeah, I don’t think I had a sort of typical journey. I started off at Cloudera, which was a different world back then, was mostly Hadoop and a bunch of other products that was sort of shipped enterprise to customers. And it was interesting to see how that industry was growing. When I first came in, just seeing the use cases that data was powering with these tools that were there and how customers were actually using these things to power analytics, to power business processes. It’s really interesting to see there. So I stayed there for a couple of years, a little over a couple of years, and popped over to Stitch Fix in early 2017. When I joined this tiny team that was kind of being formed while I was interviewing in the past year, December, that that sort of had the mandate of building big data tooling for data scientists and sort of forming that horizontal layer across the verticals that were data science teams. And a few things there were like Spark, S3, Presto, Kafka, a few members of my team owned sections of these. And so the ownership was sort of split there and I stayed there for another five years. Building better infrastructure data platform, got a lot of experience, tooling, a bunch of interesting things there. So a lot of lessons and really I’m thankful for that role actually to teach me so many things. And then fast forward, what does a 2022 jump over to dbt Labs. It was a interesting thing here about working on the architecture team to help solidify the architecture of scale it in the future. This is the cloud architecture. As we as we get more traction of the community, as we get more customers and we build the product, how do we scale? How do we expand for the future and give that quality experience to the customers itself? So that’s where that’s where the mandate of my team is and it’s exciting have been here four months. So I’m learning a lot from the people inside, learning from the community. So it’s an exciting journey. So I’m looking forward to what’s what’s in store. And with respect to coming here, in terms of advice, I think it all depends on like what you’re interested in when you’re when you’re looking at the data landscape, it’s pretty broad. So if you want to in my mind, I sort of thinking about this as if if you wanted to venture out early in your career, look at two sites with one is like more closer to the data and the other one is more closer to the actual data. Truly, I would say more closer to the data would be analysis to be data science to the statistics would be more data engineering ETL heavy. And on the other side it’s more data platform, distributed systems, data infrastructure. So if you think about those sort of hard separations, then I think it may be easier to map where you’re not just skills but interest lies. And so that’s where you can just play around and see, okay, I like the tooling better versus the data better, and then take your poison and then go for the for whichever track makes sense. So you can do this early in your career or later in the crude might be a little bit difficult to want to establish some expertize. So these days it’s it’s hard to pick one of these because it’s, it’s such attractive careers that people might just like, okay, I can’t decide. So it’s hard to maybe find one role that might give you both. But but I think there’s, there’s decent coverage in both roles. So if you even sort of both of the sides that you can take your pick your piece and just go into whichever whichever makes sense for you. As a software engineer at dbt Labs, are you focused on. Tasks for building the product itself and the product that ship to market. Or are you more on a internal team that supports processes within dbt Labs like an internal data team? Not exactly a data team. It’s more like an architecture team that actually scales to the cloud business. So we as a cloud business are responsible for providing that enterprise level of quality for an experience for our customers and ensuring that skills for the future. As we onboard a lot of more customers or users more, we get more traction in the market. Our product needs to scale accordingly. And so as we build that, we’ve we’ve learned lessons, we’ve adapted to feedback from the community which has been really strong and helping us along the way. And so my team is responsible for, hey, how can we look at this two years now and scale it better and add processes and make this architecture a bit more cleaner? So it’s a very like software architecture team that is specific to making making good decisions in terms of design and helping teams, sort of unblocking teams, helping them be more productive. So it’s very internal in terms of how to how to scale the product and making sure that it reflects the course outside to the users as well. So the our our entire goal is to to make sure the product has that experience that users expect it to be and then adapt that for the future. What’s it like for you going from a company like Stitch Fix where you may have been working with tools provided by companies like dbt, basically a consumer of data technologies to get your job done, to joining a company that’s actually producing one of those technologies. I.
Neelesh: I kind of went flip flop and I go into Cloudera, which is a vendor and that I went to end user and now back to a vendor so I can talk to both sides of it, like Stitch Fix those decisions to happen about build versus buy and then to decide whether go for a vendor or go for building yourself. And so those mostly ended up as built. But but in a, in a vendor, you’re, you’re sort of giving it to the community or there’s a strong community behind it. So that’s the that’s the good part. We have enough traction there. A lot of love from the from the community in general. And I think what I see it is as sort of yeah, I like the I’ve been an open source world for, for the most part of my career I would say, because that sort of gives you an insight of how things are actually used in the community, what use cases are driven by community adoption, community usage. And I’ve, I’ve sort of been in multiple communities, iceberg spark and things like that. And dbt has sort of had that traction. And I like seeing that. And it’s a new gig for me, like I’ll be honest with you. Like it’s coming from like tooling, building. There’s a different mindset as an engineer that you go through. So in this case it’s more basic software engineering, like how do you do architecture, how you scale public system design versus in that case, not that these patterns and these things don’t apply, but you’re looking at it in a different business value add business context, like your impact is different if I can, if I can put it that way, like you’re building some more tooling that satisfies maybe internal customers and in this case you’re thinking about it a bit more in the face of a product like an experience we didn’t have, like a product product that would surface to businesses, that was internal customers and then satisfying the needs and making them productive versus in this case, it’s an enterprise product. You’re looking at how businesses would interact and take use of it and power their use cases internally. So it’s a it’s a generally a different problem to solve because the the end goal is different. And so that keeps me excited, I think, because it’s a different challenge. I’m learning a lot from not just my teammates but from general trying out different technologies and just seeing these, see how we can scale in general.
Ryan: I think that’s I’m all first of all, congrats on the position as well. It’s it’s awesome. I know you’re part of one of like the most hot data companies today in the space and Tristan so when he came on we talked to him about just the crazy growth that dbt Labs is is on. And we had recently we had a sit in on Don, who was the chief data engineer over at PayPal. And he was one of the ones that started like on a 3 to 3 or four person team that built out the streaming service for Netflix and he talks lots of you’re talking about was what was he was he was doing that at these companies and it seems like really exciting I mean that’s like you know you’re you’re you’re building an infrastructure and you’re building performance into an enterprise cloud product, which is a lot of fun. I mean, that’s like, you know, everyone depends on that. It’s fun. It’s a it’s a fun space to be in.
Neelesh: Yeah, it’s a different challenge.
Ryan: Yeah, yeah, different times. But it’s, it’s fun. Yeah. I mean, when I was talking to him about it, he’s, he was geeking out about the fact that, you know, all these low latency things they’re able to do. And he was all over the place. So he was is loving it. Well, let’s talk about the topic today, because I know that you have you’ve spoken at multiple conferences. You spoken at Databricks and a few other conferences about this topic that we’re going to talk about today, which is how engineering can deliver better value for the data science teams. And I know that you kind of came into a groove on this over at Stitch Fix. Understand you guys were using Apache Spark and other tools. You kind of mentioned earlier in the podcast, what was what were the problems you were trying to solve there and what kind of made you what was what was kind of the driving force behind? I want to talk about talk about the topic today.
Neelesh: I spend the structure of the teams like we were a think of it as a horizontal data platform team that was responsible for providing tooling for data science teams and making them productive. So if you if you look at the persona of the people that were who were our customers, that would be data scientists were very autonomous or you would consider them sort of full stack data scientists who are very capable, what they’re doing, they’re different backgrounds, statisticians, astrophysics. There were people from academia. These these people have like very core knowledge of what they do. And so our job is to make them make them productive and give them the tooling that they need to be productive. And they sort of were the end all be all of the business process. Like they would go to their their business stakeholders like finance or marketing or ads or whatever, and then understand whatever insights they want and what is expected for the business. And then write ETL and pretty much run the show from there. And our job is to make sure they have all of that tooling to be successful. And so that was the mindset of the rationale for the team that I was part of and what spark it was. Just before I landed in the team, they made a move over to Spark. That was a few of my colleagues that had done that. And we used this tool called early on. We did Netflix. Jeanie I don’t know if anybody remembers that. It’s sort of the older orchestration platform that we had to use for Spark Jobs, and we ran that back backed by an EMR cluster. So that was a wild ride before we abandoned Genie and built our own thing. So that was one example. We had Presto as well for ad hoc querying. So data scientist could just fire up queries and just get results, not necessarily through Presto. And so we had this one guardrail that you could only write with Spark, and that sort of was the core, like two technologies that I played out, like a little bit of Presto, mostly in Spark, I would say. And a couple of my teammates owned the infrastructure that we built out of one of my colleagues who got up, she ended up joining Confluent because she sort of closer to the world and because she and another colleague of mine ended up building that infrastructure that was really powerful to make messaging bus a really ubiquitous thing. And Stitch Fix like getting data into one funnel and having that sort of spread out to subscribers was really valuable. And that sort of fed a lot of the use cases that we have we had currently before, before I sort of departed and the value there was I think, how do you make these data scientists productive and give them the tooling that is that is going to help them solve these? We didn’t write the details. We just gave them the tooling to do that. If I could simplify it.
Josh: I’m curious why data science? Right. So so you I imagine you were delivering to the analytics organization. Maybe other business teams as well. Why pinpoint data science as the target area there?
Neelesh: That was the team structure, actually. It was just data science, a data platform. There was no analyst, there was no data engineering, there was nothing else about our org. Like we didn’t even have a data engineering team until like a few years, a few years later after I joined. But this was it like data scientist were in charge of their data, their pipelines, everything. And so there was no other persona at all. Like there was no analytics, there was no analytics engineer or any any other kind of persona or data engineer for that matter. So our target audience was data science. Sorry. Go ahead.
Josh: Well, I imagine Stitch Fix does and has had analytics efforts going on.
Neelesh: For sure, but we didn’t interact with them. That was sort of other teams that were that had like analysts themselves, like finance and some other teams had themselves. But that was kind of a different org. Like not the, not the traditional one that we, we had in the algorithms section. So we had essentially just the by application of data platform and data science and then eventually data engineering. I can speak to that as well. But but, but yeah, for the most part it was heavy data science and the ratio of data science, the data platform was heavily skewed over to the data scientist. So our job is to just make sure that they have all the tooling and foundation correctly.
Ryan: I have a quick question for the platform versus data engineering, because I feel like at times those can be used interchangeably, like what’s your what’s your fine line between like data platform and data engineering teams?
Neelesh: It depends on the function they use. Like what data platform? I would consider that to be more closer to distributed systems and data infrastructure building. So things like S3, like maintaining that, make sure that runs, make sure the spark clusters are maintained, making sure Presto is stood up as a service or those kind of functions and making sure you have client libraries to interact with all that. All the tooling basically would come under that umbrella. And in our slightly I don’t know if it’s traditional, but at least the data injury function that we had to fix was more curation of data and sort of ownership of data and making sure best practices of data were met. And this includes like the quality effort we did later on that it was part of that. One of the one of my partners was a member of the data engineering team who ensure that we write tests. So we have this for sort of a meaningful outcome of what we what we expect in the data. And they were in charge of curation. They were in charge of making sure the right sets of data were provided to everybody. And so they were like literally owners of key pieces of data within the data warehouse. And so that was data engineering mostly for in our world back then.
Ryan: Gotcha. Okay. Yeah. That’s just a reason why we aca is just because we, we sometimes run into teams. We’ll say like, hey, I’m data engineering for our data platform team. Or they’ll say, I’m on the data platform team as an engineer or I’m a software engineer on a data platform. It’s like all this like combinations of weird ways of saying it. So I appreciate you kind of maybe drawing a line there to help us understand it. So so one of the things you talked about was in our prep call, too, was that over at Stitch Fix, you had this journey to build this metadata ecosystem. And so obviously, I think that plays into how you how you’re able to deliver value for data science. But what can you talk through some some of the challenges, learnings, things that you you set out to do? And then I know there’s some specific things we did that you did around a journal visor and data cleanser and data quality checker. Maybe that’s a little too deep. We’ll get to that in a minute. But tell us a little about kind of how your team went about building this this ecosystem.
Neelesh: Back then to that is before slightly before I joined we jumped into using high metascore. So this is a lot of the inspiration came from like Netflix this use of I’d met a store and since I was sort of the common metadata tooling that spark and presto and pretty much everything back then used to work well with like the new ecosystem as well. And like if people don’t use hive they use still the hive metascore. So that’s the valuable piece of artifact that comes out of it. And so that was the key piece of infrastructure that sort of formed that layer of structured schemas and actually discoverability of metadata for us. And so what the challenges were, how do you make this just this metadata piece a bit more expressive to users as we grow and scale as a company? And how do you make sure the infrastructure is robust? How do you make sure everything runs carefully? SPARC and Presto were heavy hitting it during the day. So how do you make sure this infrastructure stands? So it came it sort of stemmed from there coming from like, okay, we have the meta store, but how do we expand and make sure that interaction and discoverability is also solved? So that started off with one of the first like rest servers we built, which was essentially giving Arrestee API access into the matter store. So it took some plumbing to get done because there’s drift involved, which is a communication protocol to hive and then you build out that rest server. And then we had clients that you could use and interact with them at the store. So it was as simple as create table and you could specify the columns and specify the name and you’re done. And it would it would create a table for you. And that could be used anywhere in Spark or Presto, and it’s already there once you populate data, of course, on an empty table. But but that was sort of the the value that it gave, like people could interact with the megastore in a python way, if you will, like you can. If you do a get table, you’ll get a dictionary of a table that’s that’s valuable for them to look at what the table contained the schemas and do any kind of validations. And that helped us in our internal process as well because that rest layer was really valuable and being validation before. Right. Into the data warehouse because that was an artifact. So if you write data, you need to make a lot of validation checks in terms of like schema making sure data types are matched. So that provided that insight pretty easily. So you could just do a get call and get information. And the other piece that was sort of built along along the lines of business value, which was like providing blanking on the use case. One of the, the views that was sort of a use case of sort of building out sort of auto generated tables like if you need them, like imagine it’s kind of like materialized news. But we never actually call them that. We were just termed abuse is it’s like getting the latest partition of of a historical table. So let’s say a table has been written to every day and partition by date. And you want just yesterday’s data, which is pretty much the freshest data you have. And you can just name the table with the, the, the decorator pattern of view and you would get that table automatically generated for you because the path would be sort of changed behind the scene. So we had to write a proxy layer that I that I ended up writing and owning to do that calculation behind the scene and this sort of stuff. So imagine doing that at scale when everything hits the single point in the metascore and that was the bigger challenge to make sure that it’s sufficient. And there was some filtering logic we had to apply for like test data. I can go into depth about it. That’s interesting as well. So like isolation of test data over production, that’s also part of that proxy to be built. So.
Josh: As you built these interfaces between the data scientists that were using the data produced and the layer of infra that platform built to enable them. How did you know how far to go? How did you decide those layers of abstraction that, okay, this is the right interface layer for data scientists to make their lives easier? We’re not going too far. We’re not constraining what they do. But also we’re we’re hiding enough of the underlying infra that they don’t need to worry about that stuff. Like how did you find that, that balance?
Neelesh: I don’t think that was even. Much of a concern. We didn’t. We definitely got feedback from from our data scientists. We had internal surveys that we used to do to talk to them and get a pulse of what things are and what’s blocking them. And we used to have like sort of informal discussions about, hey, this is this is sort of a weird pattern. Can we do this better? And we got we got that valuable feedback was helpful to design those indications, I would say. Like how much of an abstraction, like you said, is enough? It came from those kind of discussions as well. Like we went to basically providing like a Python client in this case with the metadata. And I think that was sufficient enough. And then we built like a UI that could surface and help, like discoverability and the front end team sort of built that. But it was powered by this later that we had built to so that people could just search it like you wouldn’t expect them to fire like crawl commands. Instead you could just give them a UI that could be a little bit more specific. So that was one of the things, one of the outcomes of that feedback. Okay, let’s just let’s do this. Another team powered by UI.
Ryan: I have a question about the surveys real quick. I have a really hard time getting sales to respond back to my surveys when I send them out to people to get them to fill out or even the product management team and hit jobs, just get it. But how do you. I always find that like when you want to help somebody out, you want to make sure what you’re doing is providing value. You’re like, well, let’s give them a survey and they can tell us. What’s your experience with that? Because I’ve gotten back things where it’s like they’re asking for the moon and you’re like, okay, I’m not going to do that. Like a lot. I’ll give you, like, a little replica of the moon maybe, but I’m not going to give you that. How was that between like the engineering team and design team. Like, how was that? Is there problems with that or.
Neelesh: We had sort of a running sync meeting with like data scientists as well in addition to the survey. So we got a pulse of like. Not exactly complaints, but active feedback like, okay, this is actively broken, something is just affecting us and hurting our productivity. Services were helpful to just get a pulse every six months and just get, Oh, is this working for you? These are the initiatives we did a few aware where we got some very pointed feedback sometimes like, okay, this is broken, I don’t like this, there’s bugs here. This, this could be done better. Some of that more channel backend into directly like GitHub issues. Okay. Yeah, this is a bug. We’ve referred to some people. Now let’s just go and solve this. Like this is an anti pattern. This is something that we could we can do better so that that literally translated into issues and things that were more meaningful and trackable after the fact. But the usefulness of that came because we had that direct relationship with data scientist and not just that we had surveys, we needed surveys to do this like we like. If I launched something like I’ll give you an example, like when I did three years of migration, I opened the Slack channel for anybody who wants to join in. Chime in. So hey, you’re using it. Just come and tell me there’s a, there’s a bug or any, anything wrong and what you, what you see and that was really helpful. I found like two bugs that I could solve in like within a day because somebody reported it pretty quickly before it went out and got infected so that you can use different channels to be more effective. I think you have to just get a pulse of like who responds and what format, I think, and what works for different teams. So I don’t know what your specific troubles are with sales and products, but like I think finding that right balance of like what questions might give you the right outcome and like the maybe the the channel to ask them. I think that might that might help. I think it’ll take some trial and error for sure. I think I think you feel that pain there already.
Ryan: I think it’s always how you frame it, right? It’s how do you how do you frame the survey versus doing the survey? That’s the that’s the big thing.
Josh: Was there anything that you feel like you overbuilt for the data science team, anything that you felt like was ended up being constraining for them or that they came back and said, you know, we’d really like to get into the more raw material behind this and where you felt like it was taken too far from platform.
Neelesh: I don’t recall of any instance that way. Like we’ve always sort of been lagging behind because we were a smaller team that was trying to keep up. Yeah, honestly, I don’t think we were ahead of the curve. We were always somewhat behind. Like we’ve solved the use cases based on trying to be proactive, proactive and understanding problems. But I think there was some level of catch up they were always doing about issues and sort of getting the getting the right solution out there. I mean, when when it did come about, it was rewarding. Like when one of the things they implicitly did was, Hey, let’s just stabilize our spark infrastructure, let’s just make sure everything works. And nobody felt that because everything just worked behind the scenes and it was kind of like plumbing that is supposed to be done well, but it was kind of a pat on our back because we saw a lot of errors, unnecessary errors, drop in, drop out, sort of a lot of like out of memory things. Just because we set the defaults correctly, I’m not going to details, but like some of those things were implicitly done behind the scenes and those were rewarding in different ways. So I think, yeah, I don’t know if an instance of we went overboard because I would have loved that, but I don’t think I had the opportunity to do over over deliver I would say in this case.
Josh: Makes sense. And as you were. Transitioning from the team, what felt like it was what kind of needs felt like they were just at the crest of of your exposure there what felt like the big tasks from data science that were coming down the pipe.
Neelesh: I didn’t have much of a pulse back back just when I was leaving because I was very sort of laser focused on one specific thing, which was like getting icebergs rolled out into Citrix. And we saw promise in there because of costs it would save and compute time and another like I tended to the back if you want, but I was focused on just getting that rolled out. So I did like a first phase of that, got the, got the basics sorted and my objective before leaving was, okay, can I, can I give all my teammates the information that they can be? They get they have the right path because I was almost the only one apart from my like technically the manager, like we were the only two people heavily involved. So I think my objective was just making sure they productive and really actually set for set for the next steps of this. And so last couple of weeks of mine was just transition meetings and knowledge transfers and those kinds of things and just start writing docs of, Hey, where did this need to go? And what’s the sort of long term vision for it? I have a touch back side and how they’re executing, but that was sort of my last bit of like, okay, this is valuable for the company, this needs to be done.
Josh: I’m still ramping myself up on all its different value propositions because it seems like a layer that can be used to. Got a lot of benefit in different kinds of teams. It’s interesting to hear you talk about or just mention cost savings, I think you said in my editorializing there.
Neelesh: No that was one of the reasons because I’ll tell you this, like it’s a simple example, like a spark. You’ll find it spark onto a data warehouse, like a three. It doesn’t do partition putting enough well that it knows the right files to pick out of that like it lists the whole directory and then another layer would go and prune that and then give you the result. So that’s extra compute time that can be avoided. And so I spoke does this one key thing, well, attaches metadata to the location of the data files themselves. So you have that mapping of where to look for a record of where to look for a let’s see the word partition here. So it tells the engine that, hey, this data file, this would be the place to go for that record. And so that’s really valuable. It’ll save compute time, aggregate wise, like other thousands of smart jobs would save a lot of time just doing that one basic pruning operation correctly. And I think the challenges were just to migrate our existing hive tables over to spark sorry over the iceberg. And this is not just our problem. Like I saw the community folks on LinkedIn had this problem and they had to build like an abstraction layer to boot from hide the iceberg. So it’s a, it’s a legit problem because it’s a new phenomenon that people are trying to adopt and they have legacy in this case, legacy code. But, but, yeah, old school way of doing things then migrating will take a whole different exercise.
Josh: Did your team have a priority around cost saving and finding efficiencies and that led you to this election of bringing in iceberg? Or did the team checkout iceberg look at all the momentum around it and say, what can we use this for? And Oh, there’s this really nice use case that we think will save us a lot of efficiency and and budget.
Neelesh: I mean, I was the one doing it. I evaluated Delta, HUDI, and Iceberg at some point in time to make that decision. And the first part of it, I said, Hey, we were not ready because we, we had other things going on with like stabilizing the high metascore and just getting our Spark infrastructure sorted. So it was not the right time, but we did see value back then, even even when like I’ll take you back a few years. Like I saw the original design doc for Iceberg that Ryan Blue did at Netflix like I met him at Netflix to talk about it and back then we were not ready to adopt it because we had not much of a stable metascore or an ecosystem. But in hindsight, if we had tried it, it would have may have benefited us in the long run. But Iceberg was not more mature than it was at Netflix, so it was not the right time. But then we when we did jump in, we looked at, okay, what is this going to benefit us? And since it’s since this world is very closely tied to S3 and high metascore, which is kind of what Netflix had trouble with as well, I think that made perfect sense for us in terms of just bringing this in and introducing it. I think cost was sort of an just not not exactly an afterthought, but like, okay, this is the one big benefit we can get out of this. And there’s like a lot of the other things, like actually actual snapshots of tables, like you can do time traveling and go back in history in terms of changes. We should do a little bit of a rudimentary way of doing that. Like we used to have timestamps attached to subdirectories necessary to indicate versioning. So it’s kind of a. Somewhat of a version if you want like we would update any new any new data per partition would go into that new dedicated subdirectory. So we had split some level of versioning, but in this case it had actual snapshots and okay states maintained for tables. So that was really valuable that I didn’t explore that immediately when I when I did it, the first task was just like getting a table, getting all that sort of infrastructure built. But I think it’ll be really valuable if you want like backup and disaster recovery. And even if like bad data gets in, you can always slip to an older version. So that was not easily done in our world. So I think there’s value there as well.
Josh: Right. Well, I ask, is it it’s interesting to hear how these kinds of new open source platforms get adoption across big companies and operations, impressive operations like Stitch Fix data. And it sounds like you started with evaluate you you had an instinct that the service could provide some benefit. You took a look at it, you tried it out in some different use case. And it sounds like there was a cost improvement or performance improvement output from those experiments, which turned out to be like the killer application. Is that a good breakdown of how that decision?
Neelesh: I would say so because the POC actually had a lot more like like can it fit into the infrastructure as well? Like one of the other things that doesn’t go mentioned as often is like how, how hard it would be to integrate a new tool into an a current ecosystem. Like what would what would the pain be in bringing a tool like that? And I spoke for its first floor, with all due respect, is really, really powerful. And so bringing that in itself can be a task. And so I looked at a lot of different things in terms of like cataloging, like do we need to build our catalog to surface the information from like an Iceberg table? Would that, would that sort of in mind? I actually did a PR about improving the JDBC catalog that they offered in Iceberg. And so you see a PR for me in the in the community, but we didn’t end up using that because that was another heavy lift to migrate stuff and we’ll have like this dual cataloging and like how to surface that data. So it didn’t make sense on an abstraction level. Like how would you tell a user and sort of orchestrated behind the scenes to go to an iceberg table or a high table? And so we kind of counted on that and we kind of made it made the better sort of the center of the world and because it was built for that. So it had a high catalog notion. So that was really helpful to to add. But yeah, I think the there was uncertainties that I called out when we were doing this experiment like, hey, we don’t know how it’ll integrate in these things. I don’t know the streaming use case yet. I don’t know the other patterns that it might help us out. So that’s another unknown that I kept open. But for these use cases, I think it’s perfect. It’ll help us and help us do a lot of right things in our infrastructure and cost sort of emerged as one of the bigger motivations in that argument. And so I think it was easy to get buy in from that respect that, hey, this is going to save us a lot of time and effort. And that’s I think was the motivation that questions over.
Josh: Interesting. One last question then I’ll get off the three commercial. Have you quantified at all how much cost savings or performance gains it’s actually bringing you in that use case? Or do you still need to monitor that and see what it’s going to end up being? Or did you get a chance to do that before your last team before you made the transition?
Neelesh: The honest answer is I didn’t get a chance to do a cost evaluation because we barely had test tables when I was building it. Like we didn’t migrate anything over. We were experimenting on a on a migration utility to move over existing cables and see the cost comparative. But in general, like if you map query times, like you need a decent charge back and you need decent metrics to understand like how powerful or how much cost an actual query would give you 1 to 1. But there was definitely like. Observed time, like in terms of the return of the queries, like if you do a star or something for a table in spark sorry in the high table versus a I spoke table, there was a noticeable difference. I can measure it accurately, but you may have to do a little bit more stress testing and load testing to get an accurate answer. But there was visible differences in those two aspects because we had a big the experiment came from like a large table that was sort of populated daily and accessing that was slower compared to this one. And so that was the that was the one sort of experiment that was highlighting this, that this is valuable. But the bigger challenge, like I mentioned, was, can we get over new tables or even existing tables into this so that everything can read well from it? And then the other part of that was, can we get our interfaces to read from Iceberg correctly or do we need to build more tooling? We had Pi Spark, so it wasn’t that big of a deal, but we had this an abstraction to look from the metascore and use it. But did we add built some like we do with the writer interface that that made it a bit more cleaner because we had this we were thinking of this as a separate product and a separate initiative. And then we thought of like, okay, let’s just merge it to our current world. So becomes more familiar that we don’t have to do breaking changes and interfaces and give it to customers. So yeah, I think that that measurement, I could do much before I left, but at least in the early stages, it was evident that these tables would be better than the ones that we had.
Josh: Lou, if you’re listening, we want commission for this for this free case study.
Ryan: All right Neelesh, I got we have three rapid fire questions for you that to close out the podcast today and it’s been really fun I know we got pretty deep on some topics. First question is this if you are evaluating open source software and there’s a tie, do you just go with the name? That sounds the coolest, like Iceberg? I used to. I feel like the big, big open source project because of the name. So I’m not saying you do. I’m just was joking, obviously. But do have you ever had that happen before? We were like, you know what? It’s a tie, which one’s got the cooler name?
Neelesh: Oh, man, I know. I don’t know. Like don’t judge a book by its cover on this case don’t judge an open source so I think yeah I mean there’s a there’s an argument yesterday like my colleague, one of my colleagues, he came to me, he’s like, what do you think about orchestration? Like what do you think about airflow, the prefect and this like how can I use thread that listed a bunch of them? I’m like, I don’t know. I guess like unless you go and use it and actually see the pain of like does it solve your use case? You’re not going to, you’re not going to understand like you can theorize for the time, but okay, this this might work. Well, this sounds cooler, but if you go one level deeper, does it actually help you? Does it actually solve the problems that you need it to do like? And how hard is it to just stand up in this general cases? Like, would it take a lot of mammoth effort to just get like, okay, all my tooling would need to talk to this. There’s adapters of connectors that I have to build. That’s where you do the comparison. I think if you come to a point where like, okay, two tools are very efficient that do the right thing, I would choose the one that is a bit more easier to adopt versus the one that is just cooler for the sake of it. Like that’s where that’s where I think the rubber meets the road because you have to do that. That calculation has to happen at some point of time so that you, you know, the pain that you will eventually go to. That’s where I think the the bulk of the decision making would lie.
Ryan: All right. Second question, what’s one thing you want people to remember from today if you just have them remember one thing about what you said that’s going to help them in their careers today, especially work with data science teams. Like, what’s the one thing you would tell them?
Neelesh: One thing I sort of highlighted was empathy, which was it’s not called out as much as I like to. It kind of came to me as a lesson while I was doing it because empathizing with the problem, empathizing with their users and customers is important to know what you’re solving. So you understand that, okay, I’m actually solving to resolve a pain, an actual like problem that people are going through or I can reduce some debugging time or whatever the actual end goal is. But empathizing with that problem and knowing that it will, it would help somebody. I think that gives you a different perspective while designing and just like architecting things as a as an engineer. And that goes let’s, let’s talk about in general. So I want that to be a lot more highlighted in general, like regardless of principle, whether data engineer or whether you’re a software engineer, whatever discipline you have. That’s why I think you’ll you’ll shine when you understand a bit more and try to empathize with the problem with your solving.
Ryan: Awesome. Yeah. And then last thing is just how can people get in touch with you? Do you have a LinkedIn, Twitter, Substack?
Neelesh: All three of them, actually. LinkedIn is there in my name, Neelesh Salian, Substack, I started writing fairly recently, but few months now. It’s called hysterical, spells it like like it says. And I started doing random tweets on Twitter recently just to get a lot of data. Twitter insights. I’ve learned a lot from there as well. Like a lot of people post the very cool things. So I learned about like Dot and Modal and some of these cool things that are happening in the industry. And so yeah, I’m on all these three platforms so feel free to check out any of that content that I write.
Ryan: Awesome. Well, thanks again for coming on the podcast. I really appreciate your insight around just what you’ve been doing and your time over at Stitch Fix, now dbt Labs. Also, congrats on that new job you have there. And I think it would be really great for our community to learn about just how you’re approaching or you approach the away from empathy. I think you’ve mentioned you learned that little bit later, but from a way to make sure you’re delivering value to the science teams and getting their feedback and be on the same page. So I really appreciate you being part of the podcast, man. And we’ll talk. We’ll talk soon.
Neelesh: Yeah. Thanks for having me. Nice to see you guys, take care.