How To Reduce Data System Complexities

Identifying the key factors of data system complexities and how to reduce them is critical for any company. Joseph Machado, Senior Data Engineer at LinkedIn, shares his insights on ways to keep data systems from becoming unwieldy. In addition, Joseph offers tips for data teams to manage their data warehouses and keep data pipelines running reliably.

How To Reduce Data System Complexities

About Our Guests

Joseph Machado

Senior Data Engineer LinkedIn

Joseph Machado is an experienced Data Engineer, with deep expertise in distributed systems, data engineering, API design, data integration from multiple sources and machine learning. He holds a Master’s degree in Electrical Engineering from Columbia University and manages a data engineering newsletter that aims to help people land their dream data engineering job. Start Data Engineering is a newsletter with tutorials, data design patterns, open-source tools, and techniques used by data-driven companies to help others become better data engineers.

 

Check out Joseph’s newsletter @ StartDataEngineering.com

Episode Transcript

DISCLAIMER: The following is the output of transcribing from an audio recording with the use of AI. Although the transcription is largely accurate, in some cases it is incomplete or inaccurate due to inaudible passages or AI transcription errors. It is posted as an aid to understanding the proceedings of the meeting, but should not be treated as a valid record.

 

Ryan: What’s going on, everybody? Ryan here, back with the MAD Data podcast with Databand. I’ve got a very special guest on the line today as we talked about ML and AI and all things data, we got Joseph Machado on the line. Joseph, what’s going on? Man We took a while to schedule this podcast. There’s a good reason for it, though. How are you doing, man?

Joseph: I’m good. I’m good. Thanks for having me. I know it’s been a while. I think for a couple of months now. It’s been like my on-call and then Paternity leaves, so I’m glad it worked out. Finally, I’m glad to be here.

Ryan: Yeah, well, I mean, it’s awesome. Dude, you got you have a new child. You have a new babies.

Joseph: Thank you. Thank you.

Ryan: You know, senior data engineer over at LinkedIn. Now you also have a senior with you got to be a senior baby dad now that you have two jobs now all for the rest of your life now.

Joseph: Yeah. Exciting and also tiring.

Ryan: So today we’re going to be talking about how to reduce data systems complexities. But before we get into that, I did want to ask you a little bit about yourself. Audience always loves to hear about our guests. Tell us a little about how you got into data engineering. What was your path to this? And obviously you also want to talk about some of stuff you’re doing on the side, which is around your data engineering website. So go ahead, give us a little background yourself, man.

Joseph: So I started off as a data scientist, so I was doing like some analysis and I think about was like fraud detection back in the day. But I got pretty tired of like all the meetings and having to make presentations and I figured out I was more interested in like the engineering side of things like building data pipelines. That’s how I started off. And then from there and kind of get got some experience with like the big data tools of the day. Like back then it was like I spy MapReduce and then slowly went down from that, let it overflow. And then I went to do a little bit more of that modern data stack that’s called like Snowflake, DBT. So yeah, that’s been my progress progression exciting. Other data sciences, I think that’s a pretty common data scientist. Our data on this going into the data engineering world right now, I’m a senior data engineer. I think then again, similar, similar jobs like working with data pipelines. The complexity of data is much higher here, especially because of the team. I am on bulks with different businesses and tries to kind of bring them together, which is always a tough problem. So that’s what I do right now. Yeah. So on the side I have a blog called Side Data Engineering. Com. So when I started it during the pandemic, I was like, okay, I got a few hours of transit, so what do I do with that? So that’s why I started it and I wanted to write something that actually had actionable stuff, if that makes sense. So like either a piece of code that actually works are a project that someone could actually do. And I wasn’t interested in all the thought piece or state by article. So there is there’s an audience for that. But what I wanted to do was like, give people something actionable over there in a blog post. They should be able to do something different about their job or do some do a new project that they didn’t know how to do before. So it’s it’s supposed to be actionable. So that’s what I strive for that that’s my blog post. Yeah.

Ryan: That’s what now I’m all about the practicality as well. I mean, obviously I’m, I’m in, I’m in product marketing and so I see a lot of things that are said on LinkedIn or medium blogs or substack. Then I just kind of roll my eyes at times like, okay, like what do you what do you what you want me to do with this? This is all theory and tie level strategy. It’s it’s not practical. You’re giving me like, no examples. You’re not allowing me to really take to heart what you’re saying. So that’s awesome. You’re doing that because I think that’s definitely something people appreciate is giving not just, like, people’s advice, but like practical advice.

Joseph: Yeah, a lot of the blog posts they say are too vague, are too kind of broad to be actually really helpful. So that’s what I tried to do.

Ryan: Yeah, man. Well, I mean, well, you may see now just appear on the podcast. You may see a slowdown in his postings because he has a baby now. So give him space. All right. I gave him to the group.

Joseph: That’s funny, because I try to boss at least once every two weeks. But my last one I spent a month and a half ago because it could.

Ryan: I get I did. I’ve got to talk about this before we start the podcast. I got two girls, six and five. And one of the ways, I mean, it’s it’s challenging, but they’re great. Great. All right. Well, let’s so let’s get to the topic today. So we talked about we’re going to be talking about how to basically reduce the systems complexities. And first, I want you to talk about what are the things that when you think about data systems, complexities you’ve talked about, you have it in these like three different areas where I walk through these different areas, kind of explain how you would walk through. Navigate first discovering and navigating through these particular areas as relates to a data complexity.

Joseph: Yeah, so those three areas, it’s kind of broad, but you can kind of think of these three areas as the stakeholder system. So these are like your local dashboards, your tableau dashboards. Ah it can be APIs in some cases where you are solving data through APIs and then the main system is like the our data warehouse, like our data is so out of it can be like data lake house whatever you want to call it does vary are kind of trans historical data is stored and then the data pipelines like how we actually built the pipelines, what are the common issues. So like there are a lot of issues you can relate to while building data pipelines. It can get significantly complex then unless you’re really kind of mindful about it. So those are like the three kind of main verticals I want to talk about.

Ryan: The other question on, I noticed the way that you structured that you kind of started with when I announced to do this, maybe this is like a Freudian slip here or spray and thing here. But it was interesting because you started with talking about the stakeholder systems first, which is more on like the right side of Zoomer. And then you said then the middle, and then the third is the left, which is where the data starts and begins. Is there a reason why? I’m just curious as a reason why you you you view those data complexes from in that particular order.

Joseph: So that’s a really good point. I didn’t really think about doing that, but that’s how I think data engineers are supposed to think. You start from your end product, like, what do you want to it’s like you want it to be look at a dashboard or do you want it to be an API and then work backwards? So you got your stake. That’s why it goes from like stakeholder warehouse pipelines. I mean, obviously the virus modeling is crucial, but like you also have to think about what our end goal here is. If your end goal is just solving systems, serving data to your actual client, you probably don’t even need kind of a UI tool. So that’s why I go from like stakeholder warehouse pipeline. You know, you don’t want to go from like this into everything and then figure it out. You want to try to figure it out and then kind of bring what you needed to your data warehouse some. Yeah, good guess. I didn’t think about that while writing that, but.

Ryan: Yeah, no, I like that though because it’s it’s very much, you know, you’re looking at it from what are the business objectives and business goals. Right? At the end of the day, the stakeholders, the ones that we really care about. And so all the complex we talk about, all these kind of complexities, but in a day it’s whoever’s at the very end of that, they’re the ones are going to be raising it. So of course. So, yeah. So tell us like what are some identifying factors that would indicate some of these data systems complexities?

Joseph: Yeah. So I have a few points here, but the main one is if your pipelines are finicky, so like you make a small change, everything breaks. So engineers might know this. If you put up a PR, you miss a flag or you’re like, forget to add a kind of entry in a list somewhere your whole pipeline will fail. So those sort of things, it’s like the identifying factor. I worked in a couple of places where if you don’t take all the boxes, it’ll fail. And the tricky part is that it’s hard to automatically test those testers things so they’re okay. Sometimes when a new engineer comes in and they’ll even people who have been working on it for a while will sometimes forget it. And so those sort of like brittle systems, it’s like the key factor. And then the other point is like your pipeline will break, your data pipeline will break that. It’s like it’s, it’s like inevitable. But when it breaks, if it takes more than a few minutes to identify why it broke, that’s a sign of, like, complexity and like, unnecessary complexity, usually because unless your business is like multiple organizations working as one, if you’re in a simple, straightforward organization, you don’t really your pipeline shouldn’t take like. Many minutes take more than 10 to 15 minutes to kind of figure out why it broke. So I can give you an example I of complexity that I made up. So one of our API, we had a system where we were validating some data. So I built like an API and point that kind of validates the data it uses as a library guide. Great expectation to kind of look at the data, see what that distribution of data is. All right. And put it put it up in the UI. So I built this like complex system, how you trigger an API, it hits a Kubernetes cluster, opens up a cluster, puts up some data, and then I realize I did not even need to do this, like, complex spinning up a new Kubernetes. All I needed was a simple API, took, like, 2 seconds, but I over optimized that upfront, causing a lot of confusion to other engineers. So learn from experience. Tried to do the simplest thing before you optimize it. So anyways, sort of saying come identify fact are pipeline breaks and it takes a long time and then your end users identify data quality issues before you do. It happens sometimes, but if it happens frequently, there’s something wrong with the if I buy, you need to add some testing. It’s like you don’t want the worst thing that can happen. As the end users use your data to make some decision and then realize the data is wrong, that causes a whole slew of like downstream issues. Sometimes companies lose tons of money on that, so you want to be careful about that. And then there is like permission issues. So if you create a new data set, it should automatically be properly permissioned. Sometimes it isn’t, and it causes a lot of kind of complexity and time to market, if you.

Ryan: Will, on the the the end user notes and data quality issues. I think I remember there is a report on version from Five Train and they basically had a report that basically said like 71% of all business or, you know, decisions that are going on that they have, I think it was something like are related to dirty or error prone data like so 71% of time when you are making the decision, 77% of the time, there’s an issue with the data which is kind of wild. Like that was like, whoa, so other make sure that’s that’s the actual quote. But I’m pretty sure that’s what it was. And there’s something like something crazy like 85%, I think believe also make bad decisions that will impact revenue, which is also something you just said, too. So there’s a lot of data out there that that shows about data problems that are quality problems.

Joseph: And that’s one of the reasons the data quality tools have become really popular, because people realize it’s a it’s a hard, hard problem to solve. It’s a it’s not straightforward as well because the quality identifying quality, the data quality shows it depends a lot on what your business KPIs are. If you define a metric a certain way, you have to kind of write test for that. So that’s why data testing is getting pretty big these days. Another thing of data system complex, it is as an engineer, you cannot test locally, you cannot test locally. It’s going to be really difficult to iterate fast and deliver fast. If you can’t develop locally and if you can’t test locally fast enough, that’s going to be really difficult in kind of making our deploying faster and releasing features.

Ryan: Question On the testing part real quick, I know you have a lot more to say because this is I did want to ask you a question on the testing part. Is there a point where you think there is a there’s kind of like this risk versus not risk, I guess, opportunity versus reward in terms of how much testing you can inject into your data process without having to think about, okay, is this going to slow down anything? Or I want to maintain these tests like over time like this. Do you have like a certain way when you think about like how much testing is enough versus which testing is maybe overkill?

Joseph: That’s a good point. The way I typically think about it is like if you’re right, it depends on our data pipeline. If you’re writing it in like Python, it’s easy to write like unit test and integration, but overwriting it in sequel, it’s really hard to write does test for a sequel. So that’s why DVD is very popular because it makes writing test easier. But with regards to like a lot of tests versus not enough to as it’s like constructed the whole kind of idea of like TDD or assessment people prefer either I tend to lean towards a little bit more testing than no, but it’s everyone’s preference. But I do, I do understand what you mean. And I think like if if done the right way, testing does not need to be like super painful. It can be seamless, but you need to invest time and effort in like setting up that framework properly. And again, that’s why a DVD is popular, because it sets that up for you and you don’t have to worry about all these, like, infrastructure issues.

Ryan: Okay. So last time we talked about you talked earlier about multiple tools, unmanned missions. Next we’re going to get to was talking about right here, I say like forgetting a flag. Is that like a what some of that means?

Joseph: Yeah, that’s what I spoke about in the first, though. Like if you’re making a new if you’re making a change, there are certain pipelines where you have to make. If you’re if you’re just like, let’s say, investing in your CSA, you’ll have to write a file and then you’ll have to add like entry somewhere. I don’t know how to explain this correctly. It depends on the system. But like, if you’re if your pipeline is not. Well, right. And it’s not modular like you have to make changes in multiple places for one pipeline to work. That’s typically an indication of complexity. So I’ve been in places where you have to make changes in like three places for a pipeline to work or three repositories to be more specific. And it’s you typically don’t want to do that because like, you know, you had three repositories and then you have to deploy them in a sudden order and it just slows everything down. But I have seen that in multiple places, and the reason for that is in data engineering you have like the logic and then the scheduling layer are usually separate. So yeah, depends on on the pipeline, but when, when you notice you have to make changes in multiple repositories are multiple places, it’s typically a bad idea. Yeah, the last one has and things get too complex and development process gets slow. Engineers get frustrated and when the leadership is not open to change, people just quit. But I’ve seen that time and time again.

Ryan: I think that that holds true with like, I think every profession you get burned out, you’re tired of doing it. They’re like, all right, I’m out of here. Yeah, exactly.

Joseph: And the tough thing, it’s a tough problem, right? Like in the tech industry, people only stay for like two years. And in two years, it’s hard to have built, like, a cohesive vision for your data platform. Like, it depends on, like, the leaders, and they should have, like, a cohesion of vision. But sometimes the people who diagnose engineers, they want, they have their vision, they try to implement it, and then they leave halfway. And then the next person coming in is like, okay, something else. And that now you end up with a mess because the way to do things has to has been changing so much. You don’t have like a well-oiled machine.

Ryan: Yeah. Yeah, that was it’s active in like in things like security as well, like as company I was at. That was a common problem. It was we had we were a public infrastructure PKI as a service company. And you know, pica has been around forever. Right. And so those expertize, if somebody leaves and you know, they’re trying to do something around certificate management or they’re trying to maintain their existing TKI, they’re to leave. Everyone’s like, Well, how do we keep this thing going? Well, who’s going to pick this up? So it’s like you have all this really rich knowledge that if somebody leaves like that, it’s like, okay, we’re like back to zero right now. So I totally get that. Yeah, especially the engineers they are. You know, engineers are hot commodities. I mean, you know, and, you know, it’s it’s enticing to move in your career. Definitely. If you’re an.

Joseph: Engineer, it makes sense to move as well. You get so much more money than just staying. It’s like, right. I mean, yeah.

Ryan: That’s that’s like the no brainer one. Yeah. So, and that’s what’s so weird and this is like a little sidetrack and we do this a lot of times on our podcast, but that’s actually, you know, interesting point, which is everyone talks about, you know, it’s easier to grow an existing customer than land a new customer. Everyone understands that, right? It’s the same way with employees. It’s like it’s it’s way more beneficial to grow the people in your current organization and then to have them leave. Then you get the backfill, that person, and then you have to, you know, train that. It’s like, you know, if companies are seeing this a little bit more, they’re starting to put in more retention hiring or sorry, retention allocation for their current staff versus net new hiring budget. Further for, you know, new positions. It’s like look internally first and get those people to be know, look to retain them because I mean, what’s going to happen? Like you’re going to you’re going to ask for $10 more? Okay, great. Well, if they don’t you don’t give them that $10, they’re going to leave. And guess what? That next person you hired, that person will probably ask for $10 as well when they go through the interview process. So it’s like, well.

Joseph: I don’t understand that either, to be honest, but I think I guess business is business.

Ryan: So that’s why we should be ahead of it. All of our engineers should rule the world.

Joseph: But I don’t know about that.

Ryan: That probably wouldn’t work out.

Ryan: As we talked through. You kind of laid out these these three systems, right? The stakeholders systems, the warehouse systems, the pipeline. And then we of listed some of the areas of identifying factors that would tell you, hey, there’s all these complexes kind of going in me want to pay attention to those? Can you walk through each of those kind of three there and we can talk through what those may me areas may bring challenges.

Joseph: Yeah, sure. So I mean, let’s start with the stakeholder system. These are usually like your local is your tableau, your what else like metaphase and superset, all those sort of things. And people who use these are they range, right? They range from like business people who don’t write cycle who just drag and drop these different fields and the UI and people who actually write equal. So like data and laser business, the business intelligence people, they tend to write more cycle, but those are like the person also people who use these systems. And what happens is as people start using them, they create their own dashboards and then they shout it across leadership, and then someone else creates the dashboard. But then their metrics change, like someone defines percentages. I don’t know, simple example. It’s like A minus B by A, the other one defines it as B minus, maybe B or you know, the percentage change, you know, which sort of would you consider as a percentage change? So that simple change can cause like now you have like metric, you know, the same metric that’s supposed to be the same number is different. And then people tend to invest a lot of time in kind of investigating why this is happening. Well, why did the the data issue arise from that that I’ve seen that in all the companies I work for, and it’s very time consuming and kind of difficult to do because what I gave is a simple example. Now imagine like layers and layers of this complexity on top of each other and then management saying like, Oh, this is super important, we have to get the right answer now. And you’re under a lot of pressure. You don’t know what’s going on. So and you didn’t write this code. So it’s it gets really complex. So whenever you push a metric definition to the bi layer, it becomes really difficult to manage. So that’s why these things called Metrics Layer are coming up. I don’t know if you have heard of that.

Ryan: So like, you know Max from preset.

Joseph: Max versus preset. Yeah. Yeah. He’s the one who wrote after I like.

Ryan: Yeah, yeah. He’s off to low Max. Yeah he was on our podcast is our podcast twice. I saw that recently. He talks a lot about this topic. Yeah. The metrics and semantic layer and like yeah it’s he’s, he’s got, he’s a big thought leader in that space. Yeah.

Joseph: I mean yeah he runs super, I mean the open source version of browser so that makes sense. Yeah. Like as you might know, it’s like the metrics that is becoming. Yeah, it’s like so but. So it is doing. I think metaphase has it’s bringing it up. It has its own metrics there. I think it’s still in beta. I’m not sure if it’s released yet. So everyone is trying to get in on that. Basically what it will do is it will bring your it will take away the kind of metric definition of it from the business user to the engineering side, allowing business user to just use that data instead of having to create it themselves. And when you create it yourself, that’s when things get complex. And this also benefits like engineers, because sometimes engineers use the same AR application developers and engineers use the same metric. Now they don’t how to again write their own definition of they can just pull it from this matrix layer. So that’s something I see happening across across companies kind of kind of realizing you need a single source of definition instead of having it in multiple places. So that’s the usual complexity and stakeholder system. The other thing is every business, not every most business intelligence do have their own kind of sequel, period. I don’t know if, you know, like Looker has looked at ML, which is very complex, not very complicated, very confusing for someone from like the cycle side, our engineering side to kind of understand. So ultimately I feel like the goal of every member is to bring more usage to their system and they keep adding more features and features. Sometimes it’s not beneficial to us. So we need people who kind of say like, Hey, no stakeholder system, it’s not supposed to do this. We’ll do it on our layer. So that’s that’s where I think like strong engineers and people who actually know all these systems and how they work with the together and how kind of project kind of see how it’s going to be used in the future. They need to come in and say like, Hey, no, we need to kind of draw line. This is what we don’t do and this is what we do. So, yeah, that’s that’s a big issue with the battle scene.

Ryan: And we talked about tools in general before we got on this podcast as well about how there’s like insane amount of data tools out there. I think the AI tools are one of the most saturated in the market. Yeah. Yeah. If you want a way to report something, let me tell you, I’ve got a little print graph that I can look it up to show you, like what the other data warehouse site was like about that.

Joseph: So the other I mean, this is kind of key, right? The data warehouse layer is where they don’t mind. It has to be it has to be accurate and it has to be modeled. Well, something that has surprised me over and over again is like people don’t model their data. Right. And modeling is not like super complex either. You just have to understand like sudden facts and dimensions about their business. There’s this book called Data Warehousing to get you read that, and then you just have to it’s very straightforward. It’s not like super complex guys.

Ryan: You why do you think why is that so surprising to you? Like what’s going on that makes people not do the right thing? I guess in modeling.

Joseph: I think there is there are like to me there isn’t one. It’s like people don’t know that there exists like this book that can kind of get to like 90% of the way there if you just follow it. And the other thing, I think it’s like urgency. The business is like, oh, we need a trust or we need it today, we need it yesterday. Let’s make the time. Had the most. We needed it yesterday. And you’re telling me today. So those two other main reasons, I think I think like everyone should kind of read the Kimball Data Warehousing Toolkit book. Some of it is doesn’t apply anymore because they were optimizing for cost. But now like storage costs is super cheap. So but it’s still a lot of it like drags on dimensions and slowly changing moment. And those are still like really rather to this day to just read that and then maybe in man like data mart stuff you’re said pretty pretty well like I’ll tell you like it’ll put you in the top 5% at least. Good data modeling. The tricky thing about data modeling is like you need to understand how your business works first before you start data modeling. So like, how does your business make money? What are the processes that happen in our business, for example, in e-commerce at the checkout? So if you think about checkout, right, there are like two levels of data. It’s like our granularity. So like you have an attack out, you have like a check out, like an order data. So it could be like, oh, this order was placed for this money box on this day. And then there’s like item level information. So you got to like kind of think about these nuances and then there might be like item discount and then there might be like different boxes for different items. So thinking about that takes a time and all that takes a lot of talking to with your business users. So I think that’s something that needs to be done upfront and it’s easier to do it upfront then change it later downstream, because if you have to change it now, you have to change the downstream consuming system sometimes, and it just gets really messy. And one of the key reasons why a complex it is like people trust us to get their results, which they do at first. But then after a few months or a few weeks, we want to you want to do more analytics. You want to change the kind of meeting our sudden draws. And it gets really tricky. And another reason for complexity is I’ve seen quite often.

Ryan: Data warehousing talk who’s a buyer, what’s the.

Joseph: It’s the data warehousing toolkit. It’s by Ralph Kimball. He’s like, don’t us like the guy who invented like data warehouse modeling.

Ryan: Gotcha.

Joseph: It’s quite it’s not super recent. It’s like more than 20 years old right now, I think. But it’s still pretty, pretty solid.

Ryan: Well, I think if you listen to, like, you know, even some of the old stuff out there is good to because you’re saying if you’re telling me that I can get to, you know, 89% of what I need just by following some of those fundamentals, then people should be perking up and listen to.

Joseph: Yeah, I always recommend that book to kind of anyone like even in interviews, right? You go for data engineering interviews, you get a bunch of questions that relates to that.

Ryan: Yeah, in interviews all you have to do is just repeat stuff that you’ve read and sound like you’re an expert in.

Joseph: Record books are eBooks are one of my favorite kind of resources to get you to the next level quicker than you having to make the mistake and rely on things.

Ryan: Well, okay. So any other things you notice in the on the warehouse every move on a data pipelines.

Joseph: So on the warehouse side, again there is like premature optimization and dumps of cost. So like I hear a lot of noise about Snowflake being super expensive, but it is. But compared to the cost of engineering hours, I would say it’s still not as expensive as people make it out to be. I do understand that if you abuse it, like if you run tons of testing on it all the time and like a dangerous warehouse sized cluster, yes, you’re going to have more costs, but there are ways you can reduce that. But you don’t want to, like overly optimize for that and be like, oh, I’m not going to use Snowflake. I’m going to use like something like, I don’t know, tried my own python transformer. It’ll save money. Well, yes, it might. But then now you have to figure out how to deploy it, how to run it in the order tested. You have to develop it. There’s a lot of costs associated with that as well. And for a lot of startups that have money but not time, you just want to get stuff out quick. So I’ve seen people over optimize for that. You probably want to kind of figure out what their business needs are at the moment and try to optimize for that instead of like optimizing for transformation, efficiency, etc..

Ryan: So last on the list, which is where you’re at kind of or you deal with a lot data pipelines and I know probably all over the place, but traditionally get engineering teams are the captains of their data pipelines so to say for.

Joseph: Sure two data pipelines and we can talk about batch both. Right. So like when you think about data pipelines these days, that’s like a huge divide. I feel like like that’s people who wanna write like pure code but not pure code. By code I mean like inspire cutter, skull or Python. And then there are people who want to do like sequel. It’s almost becoming like a takedown thing. Like, Oh, sequel is not testable at all. Python is slow that a pros and cons, but like I want to talk about that because like I see people arguing over that. But then there is also a case where you have to kind of determine what’s best for, but what’s the best tool for your use case. Like, you can’t, you shouldn’t write Python code to do like common standard grew based on terabytes of data. You just use your sequel and that’s another reason for complexity. I’ve seen a data pipeline like people want to do everything in Python and like you shouldn’t how to do everything in pipeline because your code is not going to be optimized. A C sequel written in C++ or Java, depending on the orientation. Right. So that’s another. Issue that causes a lot of time to develop. And also our code is not going to be like super optimized and it’s not going to be always correct. Like sequel. Sequel is pretty standard. So I’ve seen like this kind of like SEC mentality there. So try to avoid that and go with sequel when you want to turn out tons of data or go with Python, when you just want to operate on one draw at a time. The other thing is, again, not having proper infrastructure for local testing. So if you have like a snowflake cluster, right, it’s hard to test locally. That’s why people use DVD. You can test locally easily. So making sure that developer ergonomics are well set up is is crucial as well because all of the thing data engineering is known for is like running a pipeline and waiting for hours. You want to have something local so that you don’t wait for hours and at the last step it fails. It’s kind of frustrating and waste a lot of time. The other one is like, again, going to the whole like using Python article, I think over optimizing on like web development practices. So in web development you have like or you have reusable code everywhere and you have like, what is that like? That’s like a specific pattern people use like data access objects, data transfer objects. It’s like how you define your data as in the our web development system. But when you bring that to data pipelines, you don’t really need like need to speed up a query into like three pieces and reuse it everywhere you can just like write a standard query. So kind of making those tradeoffs and knowing when to make those tradeoffs is where I see a lot of issues too, because like, people either tend to over optimize or under optimize, but it’s a hard line to define because it counts with experience. I cannot tell someone what to do without actually knowing what their entire data pipeline is. The other thing is like, you know, standard things like try to use, out of all my doubt, try to do some testing on every pool request. Not having those will cost you a lot of headaches because if it fails at production now you’re allowed to redo it all over again.

Ryan: Yeah I don’t want the failures in production is probably the the the number one thing you want to avoid.

Joseph: Pretty stressful because like everyone’s like oh no that I was like.

Ryan: Okay you ever been on the have you done the on call train. Oh yeah. Yeah, of course. Why? I shouldn’t even ask, cause of course.

Joseph: That’s the reason I had to move or call like I was on call. Go. They don’t want to do. Remember our first? We were supposed to have this call up.

Ryan: Yeah, that’s right. Yeah. Is it your on call?

Joseph: You ain’t got this rule on catastrophic. It’s like especially if you are like a new engineer, you need to kind of figure out what’s what’s going on and then figure out what the issue is.

Ryan: Hey, you’re up. You’re on call from 1 a.m. to 5 a.m. What am I on call for? Figure it out. You’ll get your budget. We will tell you what you’re on call for.

Joseph: But also feel like and got was the best way. I got to learn a lot of things but yeah in the was positive in a best way possible if that makes.

Ryan: Sense like throwing into a fire figure out a way out of it. Right but that’s the best way to learn at the same time, right? Yeah. I remember back in the day when I was a software engineer, I had to I was on call for production deployments for whatever we did, you know, major questions to prod just in case to do any of the testing and all of the the, you know, production testing as soon as it got pushed out, because I did all the testing prior, you know, in staging. So it was like I was the first line of person to go into prod and then make sure that everything is okay. And then I also at times had to be the unfortunate person say, Hey, there’s a bug.

Joseph: Those weekly or biweekly push ups are off because you had to do a bunch of testing. Yeah, well, good times, better times, good times. Yeah. If I don’t think it’s a data pipeline to try to make the pipelines, not produce duplicate data in case you run it multiple times, it’s also called as item potency. It’s a lot of data. People are familiar with it. But basically if you’re on the pipeline two times the same input, it should not like create twice the data that that’s basically it and also kind of make sure your data pipeline can run independently. What that means is if you’re running a data pipeline for a certain input, let’s say day one, and if you run data pipeline for data shouldn’t kind of collide cases. There are some cases where it has to, but most cases it shouldn’t make. If you want to do like some sort of look back reconciliation, it has to. But like in general, it won’t. So those are my tips for keeping the data pipelines kind of simpler. Yeah, but it’s an inevitably gets complex. That’s just the nature of software, you know, like over time, it just gets complex. It is up to the engineers to kind of refactor, think through things, kind of keep it simple.

Ryan: Well, I appreciate you walking through all those. I think I think it’s again. I think it’s cool that we in this podcast got to talk through you kind of layering in the different layers of where things could go wrong. Yeah, the stakeholder letter layer warehouse, their pipeline layer, knowing that stakeholder layer is like the number one thing that we, we focus on. But, and then walking through some examples of how to, how to reduce the most complexity. So I really appreciate you, you walking through those by saying all that we said a lot. What would be like the one thing that you want listeners to take away from today?

Joseph: Well, I want to say two things. Yeah. Okay. The first thing is, when you’re building a data pipeline thing from the backwards, they start with their end goal. Like, think backwards, right? Like what? What are you trying to achieve it? And also make sure when you are thinking through things there is one late, one happy, bad, like. Everything has to go well for this to work. But there are multiple things that could fail. Always think through these like failure. But because I think someone told this to me there’s one happy about. There are multiple failure patterns. Think through that and try to avoid adding new systems are new partners unless absolutely necessary because that will kind of create a lot of confusion. You want their systems to be as simple and boring as possible while also hitting our slice that that’s pretty much it. Those are the two things.

Ryan: Suzanne. Yeah, the happy path that I like that I like that saying there’s one happy fast, but there’s multiple bad paths or disaster paths or issues that it could go haywire. Right. And so thinking through all those different types of risk scenarios is it’s important to keep those things going. Well, do how can I how can people be connected with you? Is it LinkedIn on your substack? Like what have you connected with you?

Joseph: I have LinkedIn. I mean, it’s like Joseph Machado, it’s my name does like Joseph Machado. So his LinkedIn on LinkedIn you’ll find my profile. I also have my run my own blog start date engineering dot com. I’m a somewhat active on Twitter but mostly LinkedIn because LinkedIn I could put like longer format content so yeah most of the LinkedIn and on my blog.

Ryan: Suite man well hey I really do appreciate you coming on the podcast and it took a while, but again, congrats on the new baby. Very exciting. Thank you. A lot of Carmen. Yeah, it’s wild and crazy, but it’s a lot of fun, man. And hopefully we do this again. But congrats on the success. Definitely check everyone on the podcast I here. Definitely check out Joseph’s Substack.

Joseph: It’s not a Substack It’s just like.

Ryan: Oh, this website, look at that. He’s got a website. Look at that. He’s got one of the Substack sub. No started it, it just StartDataEngineering.com. Yeah. You can skip the substack and then check out for all the posts he does on LinkedIn. And again, Joseph, thanks so much for being on and we’ll talk again soon.

Joseph: Yeah, thanks for having me. I know it’s been a while with our scheduling. Thanks for working around that and it’s great speaking with you.