Data Engineering Comes Of Age: Choosing A Path That Resonates With What You Love

Josh Laurito, Director of Data at Squarespace and editor of NYC Data, shares his observations and insights on how data engineering has matured from a previously nebulous identity to a clearer purpose today. Josh, Harper, and Honor take a quick stroll down the data memory lane to take stock of all the changes and learnings that are shaping data engineering into a distinct discipline. The explosion in data tooling shows no sign of stopping – what does this mean for data engineers, data scientists, and data analysts who are growing in their understanding of what they love to do and the part they want to play?

Data Engineering Comes Of Age: Choosing A Path That Resonates With What You Love

About Our Guests

Josh Laurito

Director of Data Squarespace

Josh Laurito is Director of Data at Squarespace and editor of data newsletter, NYC Data.

Episode Transcript

Honor Hey, Harper, hey, Josh, either going. How are you? Go ahead. So wanted to welcome my co-host Harper, and he’s our data solution architect here at Databand and Josh Laurito, Director of Data at Squarespace. Welcome to the show, Harper. I’ll let you say a few words about yourself.


Harper Sure. Hey, Honor, thanks for having me back on. Super excited to get into a new episode today. Josh, super excited to talk to you and kind of laid out a groundwork around the evolution of the data engineering kind of how we’ve seen the industry change over time. But before we happen to like where you think it has come from and where it’s going? Tell us a little bit more about how you got into data management, how you got into working in data.


Josh Sure, sure. As mentioned, my current role, I’m the Director of Data at Squarespace, so that means I manage data engineers and data scientists for its first space, which is the best place to build your online presence. So I definitely encourage anyone who is interested in helping individuals succeed online to come check us out. We’re hiring always the I’ve been here for about three years actually started here as the I bounced back and forth a little bit between data science roles and data engineering roles. Whenever I started before I was in my current role here, I was the director of data science and I managed a team of data scientists. Before that, I worked in media and I ran a data engineering team at GMG, which was called Gawker Media. Once upon a time I was there through, we had a very highly publicized bankruptcy involving a lost court case, and Hulk Hogan was there through that process and its acquisition by Univision. And before that kind of worked in various different data oriented roles of startups and and and other industries. But in my current role responsible for data engineering, data science, everything basically everything downstream from production systems that involves reporting, providing information to analysts and other decision makers, and building inferential systems and different metrics and ways of understanding our customers, our marketing, how our customer operations advisors are performing. All that stuff kind of falls in my purview.


Honor Very cool, so. Well, thank you so much for being on. And we were really curious about your take both of your takes as data engineers, how this space has matured over the last few years and what used to be or what seemed to be an amorphous idea of data science. Data engineer has now gotten to a place of greater definition and clarity and actually a related conversation recently with data council organizer Pete Soderling. And he told me the founding story of his group where it started with just a lot of data scientists and data folks going to him and saying, My data engineer, Does this make me a data engineer? Like, Are you data engineer? So where are we right now? Like in depth putting that into context? I think there is now more clarity on what that is. So I want to get your thoughts on that.


Josh Sure, I can I can answer first, and then Harper, if you want to work at it, that’ll be great. But the looking at this from a data scientist perspective, I think one of the big and looking back a couple of years, I think one of the big documents that I look at was really formative in how people think about data engineering and how it’s changed, whereas by engineers shouldn’t write etiologies by Jeff Magnuson from Stitch Fix. And this is really, really popular post talking about how you know you don’t want to have just your data scientists as these doers and these people who are these thinkers, who are the people who are coming up with the idea and then throwing things over the fence to data engineers to to implement their ideas in a scalable way. But you really wanted to have data engineers or data engineering systems as people who are really building out platforms and tools more than anything else. And I think that was like really a piece that was really ahead of its time and held up very well. I remember five years ago coming from the data data science side where there was a huge amount of hype about data scientists, and it was really hard to find people with the right skill set. And so data scientists were people who are getting pulled in from all sorts of different technical areas who might, not, didn’t necessarily know very much about computing or how to make certain type of calculation or very efficient. And it kind of created this incredible demand for people who understand the computing side who are engineers who understood how to build repeatable processes and uptime and understood SLA or an SLA was. And we think the growth of the data science industry really drove this demand for data engineers and and drove this demand to figure out how do we make a broader set of data tools and tooling available to people to work with data downstream? And so I think that since then, like data engineering has taken a little bit of a different turn and has really grown a huge amount. But I think I think of that is like a really key point five years ago where things started to change.


Harper I agree 100 percent. A lot of the things you said really resonate, especially when you talk about systems and how that relates to like the data engineering field relative to the data science field and coming back to Honor’s original point with the conversation she had with Keith. I always joke with people that it depends on who you ask, what definition you’ll get from a data engineer and you will tell. You will understand very quickly whether they’re data engineer who came from a data background or someone who came from a software background, right? And I think that’s a totally valid right. But there’s two different pathways to kind of get to where we are because I do find data engineering and it’s in its adolescence at this point in time. And as as you mentioned, George, I agree 100 percent that, like data science became really popular and really hot, and everyone wanted to find the best way to do analytics and to make better decisions and use the data that they had. So they brought in these statisticians and data science boomed, right? And then suddenly they realized that all of their developers that have been just moving data from one place to the other, maintaining a warehouse. Well, we need them to do a little bit more because data science needed better tools, needed better platforms and needed better repeatability to be able to understand what their models were doing and make them more predictable. And then you see the developer kind of having to apply software engineering principles and like robust data engineering right now. And then from that same point of evolution, you see software engineers starting to see like, OK, data is a much more relevant object in my system at this point in time. Like, we have to be data aware, we have to make sure that we’re applying best practices to data management. And so the software engineers start coming from that DevOps field a lot and start looking at, OK, how can I apply the best principles that I know from object oriented programing and functional programing and then ensuring that I treat data objects both as a stateful object, but also something that is going to be valuable to the end user at the end of the day. And so you have this conglomeration of software engineers and developers and data science all coming together to, you know, we have the data ops movement that’s going on and we’re seeing an explosion of tooling right now, too. And that reflects the like the the needs that are there from like the data science engineering perspective and also just the appetite for an improved process around that.


Josh Yeah, I absolutely agree that it’s funny. You mentioned that you get really different definitions of what people of what a data engineer is and different organizations. I have a I have a presentation that have given a few times internally at Squarespace, just trying to make sense for people on what does a data scientist do here versus a data engineer versus an analyst versus an analytical engineer versus. Machine learning engineer says they’re all different titles that we support here, and they’re all different, slightly different roles, and it’s far and away the most popular conversate, most popular presentation that I’ve given internally because people are so confused about it and they have such a hard time with it. But one of the things that I did when I was putting that together was I just went online to all of the different. Schools online are places you can get training online, from Code Academy or from Coursera or from Udemy or like all of these different ones, and said, just like what for what their definition was about a data engineer versus a data scientist versus an analyst. And there’s no consistency at all. If you’re studying, if you’re studying the stuff online and you were trying to piece it together without actually being an organization, you get completely conflicting ideas of what different people do. And so I always I always start off my explanation with with with that bit of it, which I think makes people feel better, that they’re not missing some kind of like big that they didn’t just like, missed that day in school. It’s just like no one really, really has a clear, consistent definition.


Harper Absolutely. I think it’s the best way to frame what is like a data team at this point, because a lot of every organization, I think, recognizes that data is valuable to them. And how are they going to extract value to it is why we’re seeing this big evolution and data engineering and a big growth in the data science base. I don’t know if you’ve been in this situation before, Josh. But my my previous and the place I was at previously, I was on a team working on a NLP project and I worked very closely. I love the data engineering team. I worked closely with the data science lead and often he and I would sit down and discuss, OK, here’s the needs for your team because this is really what we need to happen with the data. And then whenever I walked away, oftentimes more than not, he was writing as technical, if not more technical pieces of code for his data science platform than I was writing for the data engineering aspect of things because I, our team as the data engineers, we are the stewards of the data as it comes from source systems that we either control or don’t control, and the data lake making sure that we track the state and make sure that we can resolve data issues and then eventually putting it through a transformation to provide to data science that then really kicked off our analytical flow. So you worked as head of data. You said you were in charge of data science and analytics and engineering. How do you like the tools in the workflows of those different disciplines? Both look the same and then also differ from one another at Squarespace.


Josh And I’ll tell you that one thing I want to touch on is you mentioned the word steward and stewardship, and that’s something that we’ve been we’ve been building out of data governance process inside of cyberspace and try and understand. We’re relatively recently where we’ve become a public company. And so this is something that is getting a lot of attention. It’s gotten a lot of attention for a while, but it continues to get a lot of attention and understanding who the owners are and having strong people exhibiting strong ownership and taking a hands on approach to the quality of the data both inside of your data engineering team and for the people who are creating the data is incredibly important. And I think there’s no there’s no real way to fix that downstream. If you don’t have that involvement and investment upstream, you can try to encourage it and explain to people the value of taking ownership there and being responsible for the data and and the quality downstream. But you really need to convince the people upstream or have people who understand upstream that they have an important role to play in terms of like the technical work or like the technical division of Labor between the people who are and tooling for the people who are ingesting the data, who are responsible for the ETL pipelines and the users downstream. I, you know, the ecosystem of different tools has really has grown a huge amount. It’s hard to say that it’s it’s hard to say like, you know, in some places, it’s become a lot more fragmented. Like, I look at all different ways that you can handle API integrations and the really, really broad set of new database technologies that are out there and the broad set of new monitoring and tracing tools as places where things have gotten a little bit more. There are a lot of options right now, but at the same time there have been and maybe EO ETL orchestration or ALETHIA orchestration is going that way. I think people had really coalesced around airflow for a long time is like the kind of most common open source solution. But now there’s a bunch of new tools out there that are really interesting and that are trying to pick apart some of things that or find some places that they can improve on the airflow model. But in downstream, in the data science world, we’re like, sorry, sorry. But on the data engineers side, there are also some places where it looks like there’s an analog engineering side. There’s some places where there are new standards activities. There’s one that really is like, this is really the way that you use structure, how you’re going to, how your skill is going to look and how you template your skill and make. Reproducible and manageable for a large team on a data set.


Harper But before we jump into the technical aspect on the tooling subject, I don’t know if you saw the recent post put out by, I thought by Matt Turk, like the machine learning AI data landscape coach that you shared recently honored with that large infographic, the man?

Honor Yeah, the mad landscape of data tools. It’s I mean, I don’t know how many actually are in that graphic, but it’s enough that when it’s posted on a feed like LinkedIn, you can’t actually see any of them. It is. It’s crazy. I mean, so I thought it was really brilliant that they call it the mat index because it truly is. It’s exploding.


Josh It’s keeping track of all this stuff. Is just is a real challenge, really.


Harper It kind of ties into the what’s what’s a really popular topic around like the modern data stack to where you have a lot of this focus on all of the tools that are being implemented to kind of like work together with one another. But you lose the context around like your proprietary Python scripts that you have to do. And that may actually be where you’re going with the technical aspect. But that’s that’s one distinction that I see between like the analytical side of the house and like the more upstream side where you have the data ops and data platform and data engineering team, where a lot of times there isn’t a tool out there that you can just pick up and put into your system to address all of the needs that those upstream teams need.


Josh Yeah, that’s definitely been. My experience is that we don’t have a there’s no there are some really powerful tools out there that we work with and that. Like that can do a lot, but they have to be powerful because they have to support all sorts of different ways that people have built different, have built on top of different data stores, have structured access differently, have different sizes or varieties of data, so they have to be able to scale up or down and kind of in different and different dimensions. And so just like the vendor officer making sure that all those pipes working with the vendors is really a job in and of itself. And I kind of like,


Harper I don’t envy for the record.


Josh It’s a hard job. It’s definitely a hard job. I mean, you know, we there are a couple of tools out there that just exist to pull in data from all different APIs that you might want. So like a stitch or five trend and even those with like a relatively narrow mandate that we’re we’re just an extraction tool, right? Like this is all all we do is extraction. It’s like just the number of of different endpoints they have to handle. And the constant updating to those like becomes incredibly complicated and definitely more than justifies having like a couple of players in that market for sure.


Honor Do you feel that this degree of tooling the depth and the specialization that’s happening now and this? For lack of a better word like debate about our duty to engineer a data scientist, are you? So early on in the evolution of the space, does it make us lose sight of what really matters, which is data quality? What does this data actually do? And going back to your point, earlier creating that sense of ownership around data quality throughout an organization, is it? More important to establish that culture before. Being. Pedantic about what is what.


Josh I think it’s really important to. You know, the cultural aspect is is tough to change. I do think that there are I have noticed that there’s a there really is a difference in title and people respond to it. And this is more about building a team than it is about building a building, a pipeline of the technical questions. But I know whenever we’re looking to hire data engineers and we process the data engineering job post, we’ll get applicants and whatever. We’re trying to hire a data scientist and we post a job for a data scientist. We get inundated with applicants. We’re just so many people are interested in that role. And it’s like one. It’s like funny to me because, you know, I think that the roles are not, you know, they’re different for sure, but not like a wildly, wildly different people are working in a lot of the same technology, a lot of the time. And and you know, the people who are but but but that like that signaling matters and people definitely respond to it, for sure. So I’ve definitely seen that.


Harper Yeah, something I heard that you kind of touch on a little bit. Josh was the you have these specialized tools coming into play and they’re trying to capture a lot of complex information in a very succinct and standardized manner. And for me, that reflects. The general like high context field that data engineering and data management is. And so when if I were a company, if I were a vendor creating a new product, I can understand why I would make the choice to say, OK, I’m just an ingestion tool because I’m going to focus on just this one part I can. There’s still a lot of complexity there. Like all the different APIs have different request and formats that are coming back to them. But at least then, it’s a manageable amount of complexity and context that I can show to my client. And I think that’s part of the reason that you see this explosion and tooling because I there’s just there’s no way to abstract everything that’s going on out there. I can’t cover every ever use case. Right? Yeah. And so I would say that that also gets reflected in what you’re seeing. You mentioned the different hosts in the titles that come through in the signaling that comes across. I bet it’s intimidating, I won’t even say, but for me, it’s intimidating to apply for like an engineering position or something that is very technical to like this particular stack. Because, yeah, I think I’m a data engineer. No, I don’t think I know I’m a data engineer. I do things that I’m very confident there. But the way that you use that tool, it can. I’m not sure how you use that tool, right? Yeah. Data science feels it’s highly technical, but it still feels like there’s like 40 percent art and 60 percent technicality, like if that makes sense. So they are more inclined to be like, Yeah, I can do data science. Let me you get really interesting people. I bet that apply through that manner and I look at that.


Josh Yeah, absolutely. And I think like the that tension between like the kind of the Unix philosophy on one side of like, let’s do this one thing. Well, it’s hard enough to get that done. There’s a lot of complexity here anyways versus the I don’t know, probably like the person who I see talking about this other perspective. The most is like Eric Bernard of CTO Better and was an early engineer at Spotify before. But talking about just like the ergonomics of data tools and cloud tools, infrastructure tools and how painful they are and how they could be absolutely better and trying to find something that kind of is a little bit more batteries included or a little bit more easier to understand for all the people who you know are capable of, you know, like fiddling with the dials and like adding and optimizing systems, but really just want their systems to work primarily and aren’t as interested in like maybe the the optimization of front.


Harper It feels like there’s only a matter of time or a matter of time until we see the consolidation start to occur. You can only create. You only cover so many cracks in the data landscape with a new application until there are no more niches and cracks to cover, right? And then then hopefully people will start to understand like, OK, I need to give you the AA batteries. I don’t need to send you to a GitHub repository that happens to provide batteries whenever you plug it into your Python code. And but I also like I really appreciate Eric’s voice in that conversation around like the pain points of using the data tools. I love the conversations that starts on Twitter, just kind of making people think about like, OK, just because something’s hot new doesn’t mean it’s actually going to be easier for your company to go in that direction. Like, it’s funny, the build versus buy conversation is always something that comes up when you’re trying to build on a feature or to understand the best way to manage your pipelines and make something repeatable in your process. And a lot of times you come across. I come across these data products and they go great. This would be something I’d really use. But is it actually easier? Would it actually save me the development time to not build it? Or is it going to be more painful and more time to actually buy this and implement as opposed to if I just had built it on myself?


Josh Yeah, no. We deal with the same question all the time and try to figure out how we not only like those two different questions, like is this? What what tooling should we be buying and should we be buying tooling at all? I really have like two different two different processes or two different two totally different standards that you hold those things to and really like. Sometimes you don’t even realize that you’re making the choice on. Should I be? Should I be building this or buying this in a meaningful way? We’ve just started being really intentional. I shouldn’t say just started, but we’ve started being really intentional about those decisions and trying to before we take on any work, like really writing down all of our old, all of our alternatives and understanding, do we have to build this? And before we make any purchases, do we have to buy anything? And I think that’s led to some better outcomes in our side. But you know, one other kind of issue with that, though, is that like as you use all these different tools, you kind of get you get moved into this ecosystem of things that work really, really well together or that you expect to work really well together. Whether or not they do. And so I think people, once they’ve made a decision in that build diversify, it tends to like reinforce itself to a certain extent. You’re building parts of your stack probably makes sense to build on it. It may make more sense to build other things that work with that versus if you are are buying more things, you’re more likely to be on common standards that are going to be supported by other tools. Integration will be easier. And so there’s like an incentive or it’s incrementally easier to to continue to buy. And so that’s something that we worry about a little bit and try to make sure that we don’t get stuck too far one way or another.


Harper That’s an interesting trend that you highlight because we’ve kind of seen that like pigeonholing into a certain certain stack within the hardware space, right? It’s one of the reasons that Apple so successful, right? Like, yeah, I watch the iPhone, the Mac, because all of those items work really great together, right? But if you go, man, I really want to use a different storage platform. It’s really painful to get out of that system, right? Yeah. That’s nothing against them. It’s it’s brilliant marketing. It’s actually really great product use, right? But I it’ll be an interesting trend to follow over the next few years to see if that same mentality ends up evolving inside the data space because we see it occurring right now. Like all of these, this particular modern data stack all integrates really well together, but it’s not an open source standard or it’s not really. There is no standard way to do that item. So you end up staying within those same tools that talk really well together, whereas you would have more flexibility and you could do a greater or a more holistic approach to the way the manager data if you weren’t stuck in that particular area. That’s a that’s an interesting point. I hadn’t thought about that.

Josh Yeah, we’ll see if any of these if these companies start to to integrate and buy up others. Mostly, I think we’ve seen more. And this isn’t this isn’t a business podcast. I’ve seen more like I feel like horizontal merging, you know, with all the buy tools buying each other and being in that universe kind of getting a little bit smaller than we have. You know, you see, I Google buy looker, and that’s like a little bit of let’s integrate this visualization or business intelligence layer on top of on top of the cloud provider. Maybe we’ll see more stuff like that in the future.


Honor So for the whole business, I’m curious to get your perspectives for the health of the space as it matures and grows. What’s your vision like? What do you think is a healthy not asking for predictions more like what you think should happen that will actually support a sustainable advancement of a data as a whole? Or we see drops?


Josh Well, I guess I guess maybe we should talk about what data ops is if we’re going to talk about if we’re talking about the right thing.


Harper Are you saying that’s just not a common term that everyone knows and every definition for ended up? Can you give me your data engineer definition, Josh. I just want to make sure that


Josh I’m clear whatever whenever I present on data engineering internally, the what I’m usually talking about is the fact that you can’t split up people based on their tasks because the tasks are too similar. Because everyone, everyone writes sequel. Everyone cares about the quality. Everyone is like evaluating different tables and writing data somewhere everyone complains about time zones. Everyone like does like a lot of the same stuff. And so like the really the the the way we divide things up is really where people are in relation to our primary analytical data stores, with data engineers being responsible for all the work done to create and maintain those stores and the analysts being like. So it’s the data engineers being kind of more clearly on one side and analysts and people who are embedded into the functional teams we work with being more embedded on the take the data out and support decision making. Run AB tests, put together narratives around the data and use that to drive action and decisions. And then data science has been the one that’s a little bit in the middle. But really, that group is geared towards setting up the incentive structures and the metrics to make decisions and guide people. So the idea of if we are trying to improve our onboarding of new users, how do we evaluate whether we’ve done a good job with their first visit or their second or third visit? What how do we create some kind of metric or some kind of way of assessing this person had a positive experience or a negative experience? And then we can use that those kind of like indicator metrics or variables to evaluate the effectiveness of different or different changes that we’re making or of different approaches or of like creating new data sets around sizing different opportunities. When we’re deciding what types of markets we want to go into, features we want to build, understanding how we can best allocate resources generally that it tends to be more of a more of a data science question of understanding how do we infer, infer these different things or these different concepts? So that’s what it means here at Squarespace.


Harper Hmm. I wonder, I wonder how intentional it sounds like you all been very intentional about the way that you’ve broken it up. But I wonder how much the data space in general or like data practitioners, think about taking the way that software engineering has evolved over time and then trying to apply that to the data titles that you talk about. Because one thing I heard you talk about was like, you can’t divide people by skills or even necessarily process, right? Because because everyone kind of has the same skill set that kind of work with the same thing, they have different perspectives. But then data engineers are upstream. They’re putting out those data sets. Data scientists are kind of driving forward the the understanding of what is needed to be able to do analysis and then also driving forward what type of analysis needs to be done better done, further correcting wrong with the right there. And then analysts are kind of like coming off on the downstream off of the data science process and really making really strong visualizations and insights and answering questions that are coming out of that process. And I see that reflected in the software engineering space where you there’s a very broad distinction between like front end and back end right. And like both of those distinctions came out as like, OK, it doesn’t matter what you’re using because you can use a different stack, but what’s the output that you’re going to provide and what you’re going to focus on? Right? And so we ended up splitting engineer software engineers into people that focus on the front end or web development falls under that item. Or you can have people falling in the back end, which is closely related to like data engineering work. But again, they’re focused on just really those those APIs, the servers, those intangible assets that users don’t interact with. And it’s kind of the same distinction that I heard there, where it’s really the output of that position that you decided just the way that you define that position and then putting all those together that that kind of helps lead towards that data ops organization as a whole. Right. Yeah. What were you in as well?


Josh Well, it like that is there are some similarities and some differences. So I think that the whenever I think of the kind of front end versus back end web web development or kind of like traditional web software development divide, I think there’s a very clear environmental difference of like one of them is primarily working in the client or the browser, and the other one is and not exclusively, obviously like anything on the front end has to get served from the server at some point, and that’s a good package and need to be able to handle it, you know, as responsibilities and contracts it needs to engage with for the back end. But I feel like the difference between the data engineers and the. Of scientists and even the analyst is like a little bit less clear cut and search a little harder to divide people based on the environment that they work in the technologies. It wouldn’t be weird here for an analyst to write a and airflow job or search and certainly wouldn’t be weird for them to go into air flows UI and evaluate how things are running and then say, Hey, this something seems wrong or or or keep track of some of the monitoring and raise their hand or something seems wrong. I think that the the thing that does seem similar and where we get to is whenever we start subdividing a little bit more and we do things like, Hey, where are we going to put analytical engineers into this structure and how are we going to split our analytics engineers from data engineers or analytics engineers from data scientists and some of those questions of like who’s upstream, who’s downstream, who’s depending on who who those seem? Those seem pretty similar to the what what web developers deal with in that, like the analytics engineer, is probably more clearly focused on the end model and is working primarily out to a large extent in like domain specific modeling languages and with big, very specific tooling. Whereas data engineers might be working in like kind of like a broader general purpose programing environment or like closer to bare metal. And so in some places, I definitely see that being very similar.


Harper Yeah, I appreciate that perspective, I think my my joke about ambiguity kind of threw us off topic, so I do want to bring us back to his original question about like data ops and such. And so it’s it’s really good insight to see how you all have taken this approach to looking at what these roles mean and how they serve Squarespace. And then does that help you inform your idea of data ops? And why don’t you give us like a brief like definition and thoughts on the adoption of how that works for you all?


Josh Sure, I could give you. We actually have a dedicated data ops team at a Squarespace. We launched it a couple of years ago and the intent was to be and really to be an enablement team. And like the if you’re familiar at all with the topologies of team structure about how different teams are laid out, pious and I forget his father’s name. Sorry, right about there are kind of four different ways that software teams can work. There is a a stream oriented team, which is a team that is like the most common team which takes in requests from some some sort of outside stakeholder and provides that information. And we have teams of data scientists, data engineers like that that get requests from a marketing team, a product team that say, Okay, we need analysis on this. We need to make a decision on this. We need to figure out how to evaluate our marketing on X, Y or Z. There are a complicated subsystem system. Teams that are experts in one particular thing, like a machine learning team would be an example like that, where they are just totally focused on one technique or one type of system and you lean on them. The third type of teams, a platform team which is providing endpoints and a platform for other teams to work off of. So data platform teams, I think, are pretty common and manage compute resources and storage resources and things like that. And then the fourth type of team is enablement, which is like a little bit more amorphous. But a team that brings in best practices from outside supports primarily stream teams with things that are complicated or outside of their outside of their wheelhouse. And whenever we created our data operations team, it really had that broader enablement mandate where it says, you know, we have we have lots of things that our stream teams need, the teams that support different parts of the business. They need help with monitoring. They need to understand how continuous integration and continuous deployment work here. They need to understand how to store their code and package their code up and repo it. They need to understand they need help with some of the whenever the tooling has problems and how to diagnose issues that are not really in their wheelhouse. And this team ends up being a partner to all of those stream teams and helping them out. It’s a really, really broad mandate. The whenever we launch the team, I thought of it like primarily as this like enablement team that was going to be doing all of these things to help support support operations broadly and help these teams be more effective and and bring in things like basic templating and skeletons for how we work with querying how people spin up services, how people can professionalize some of their work. The first year we had this team, basically what it turned into, though, was that we had this backlog of requests from infrastructure that they had been updating their systems for a long time and saying, Hey, we’re moving to a new continuous integration continuous deployment pipeline. We’re switching how we handle repositories internally and doing all these things that they were tackling their technical debt, but was requiring people downstream to update their systems. And so whenever we the first year of having a data operations team, actually their entire mandate was basically treating infrastructure engineering as our fourth source of work or fourth stream and just taking care of all the stuff that they were saying that they needed us to do to get on to the latest and greatest versions of following best software practices. And in retrospect, that makes sense. You know, I don’t think that anyone. There are lots of great things to say about data scientists, and I’d say this as a former data scientist in the data science manager. But I don’t think good software hygiene is something that data scientists are particularly famous for. And so having a team that’s there to support them and say, like, Oh yeah, you need to upgrade to this newest version of X, Y and Z or you need to migrate to this new deployment pipeline was really helpful and helped us catch up and move from being maybe a little bit behind the curve on. I’m using the best infrastructure the company had and the best infrastructure that’s available to to being ahead of it and taking advantage of all that work. And so that’s been like kind of a one surprising thing of like the lived experience of having someone devoted to data operations for the last few years. Since then, they’ve taken on more like the stuff that we were planning on them doing and introducing DVT here and helping to teach others how to use it, introducing better monitoring and helping us figure out how to build sensors into ETL and stuff like that. And so that’s been really valuable as well.

Harper A quick shout out to Matthew Skelton, who also wrote team topologies


Josh with, Oh yeah. Matthew Skelton’s the author Thank you and I want


Harper you to feel left out, Matthew. I obviously love you more than than Josh. You’re not kidding, I Google Trends.


Honor Well, I think that’s that’s awesome. Everything you just said, Josh, that really gives us perspective of what it looks like when you do have that level of dedicated intention around data ops. As we wrap, I do want to bring the conversation to more of the action step kind of angle. I know that you have a lot of that. You mentor a lot of students and that’s how you started your NYC newsletter.


Josh I used to teach. That’s right.


Honor So, yes, thank you. So I imagine that you probably get this a lot and this is something I also see a lot of on Reddit. Pretty much any community forum. There is so much interest in entering the field. So if someone were to come to you and say. I’m at a crossroads. What should I do? Am I more of a fit for pursuing data science versus data engineering? What do you usually say?

Josh I actually get this question a whole lot. And my answer is changed a lot over the last 10 years that I’ve been getting this question. So 10 years ago, my answer was Learn sequel and then go interview and you will get a job because people are so desperate to hire anyone who knows sequel. You’ll be fine. And that was right for from 2011 to about 20 and 13 or 14. About five years ago, I was pointing people to this piece by Vicky Boykins is a terrific data scientist and writer who works for Automatic. And she had this piece of data. Science is different now, and it talked about how it used to be that like if you knew tech and you knew anything about statistics like you were in. But don’t worry about it. They’re so desperate for people, and that’s changed. Not because there’s any less interest in hiring people like that, but just the supply of people who have been excited about this job and taken courses on it. Like now, there are data science and data engineering undergraduate courses of study, and people can get a degree in data science and engineering, which is like unheard of whenever. Yeah, totally new. And so I heard Vice was to people like find like focus on your domain. I like the thing that, you know, whether it’s marketing or are building widgets or shipping and logistics or whatever it is. And then like, just become more like whatever problem you have has a quantitative and data engineering component to it and learn that. And I think that was really good advice for advice a long time and still is really good advice. But now, if someone were to come to me and say, like, I’m thinking about data science, data engineering, what what should I do or how should I move forward, I would probably point them to. I’d probably encourage them to just build something like it’s become easy enough to not easy enough, but the tooling is so good that you can really like, do a huge amount with relatively with your hands on the keyboard less. And you can build some like really interesting projects just to test out which, you know, just just to test out. If it’s the right field for you or if it’s the right thing for you and which aspects of the project do you like, like if you’re working with? You know, we used to have people do keystone projects whenever I was teaching and we had, we’d have people like, look at all the tax lots in Philadelphia and see who was paying their taxes on time and who wasn’t. And so it’s like, do you like the challenge of getting that data in and like moving like and figuring out how to make that fast and efficient and handle the volume of it, like maybe data engineering is for you? Like, did you like the challenge of normalizing it and improving the quality of it and taking lots of different sources and making them work well together? Like maybe you’re going to be an analytics engineer? It’s like, did you like the last step of analysis and visualization? Like, maybe more data science is more your your field is like? Or did you like telling the story in like like drawing the building the narrative around it? Like maybe maybe your data analysis and being a data analyst and that world is the right thing for you. I think giving people a chance to play around with all the different tools and do different and do different things and find which parts really they really gravitate towards is probably like the best approach,


Honor basically rooting it around their own passion, what they are naturally going to feel inclined to do. Harper, what about you, Dave? What do you what would you say someone came to you and ask that which direction?


Harper So I mean, I was I was actually that person 10 years ago that was asking just like that great. And I absolutely got that same advice like learn skills, someone, how you right? And that’s kind of how I got here. And I think that the where Josh landed there at the end, where he talks about what aspect of data management do you find most interesting, like the efficiency really leads the engineering side, like tackling big problems, like looking at interesting things like just testing and applying science. Like, I hate to use the word to define the word, but like but applying the scientific method to like analytical evaluation of data like that is going to lead you toward that data science role, right? And then being able to wow people with the answers that you provide or efficiently giving answers. Those questions like leading that analytical side like, I think it feels like it comes from that, that I see the answer kind of influence that perspective there because it’s still kind of like focuses in on that domain as far as like my own personal life. When people ask me that question, I still point people to automate the boring stuff with Python. Like, it’s because I agree with you, Josh. Like, you just need to start building, right? Because a lot of people coming here, they may not have. They may. They may know more about computers and programing, and they expect that they think they do. But you really have to just start getting comfortable and being confident and doing that work because in the reason I point them there is it’s very easy to get started, but you work through a lot of different workflows from there. I just recommend that people still learn cycle. They learn some sort of object oriented language. Python is obviously very popular in data management and then get familiar with orchestration like whether that’s just running cron job, like getting into airflow or one of the other like open source tools like Baxter or prefects like. Understand what that means because all of those skill sets, they can apply to the entire data. Obviously, the platform, engineering, science and any of those people could work with any of those three platforms. And so once you get comfortable with that and then you start getting involved in those communities and start having those conversations with people will eventually find yourself identifying what it is that you enjoy about that process and then you can specialize from there. But don’t. I mean, I would even recommend to people like take two years to learn what data management means before you worry about being a data engineer. Right? Just become data literate.


Honor That’s right. That’s great advice. Yeah, I really like that. And I think that it’s mean, especially considering how fluid the field still is. It’s not like you get locked into one thing and and you can’t escape. You can always change your mind later on. So I want to thank both of you for this conversation going to go ahead and wrap here, Josh. Thank you so much for coming on and sharing your insights on the evolution of the data engineering space. And of course, Harper always love hearing about your background as well. So thank you again.


Josh Thanks so much for having me. Really appreciate it.


Harper Great conversation, Josh. Thanks.


Josh Shout out to Matt Skelton, whose name I forgot earlier


Josh But thanks, bye.


Suggested related links: