Lessons Learned Building World-Class Data Engineering Teams

Sid Anand, Chief Architect & Head of Engineering at Datazoom, dropped by to give his lessons learned in building world-class data engineering teams at Ebay, Netflix, LinkedIn, and PayPal. For anyone who’s leading a team now or in the future, you don’t want to miss Sid’s insights in working at some of the world’s biggest brands.

Lessons Learned Building World-Class Data Engineering Teams

About Our Guests

Sid Anand

Chief Architect & Head of Engineering Datazoom

Sid Anand currently serves as the Chief Architect and Head of Engineering for Datazoom, where he and his team build autonomous streaming data systems for Datazoom’s high-fidelity, low latency streaming analytics needs. Prior to joining Datazoom, Sid served as PayPal’s Chief Data Engineer, focusing on ways to realize the value of PayPal’s hundreds of petabytes of data. Outside of work, Sid is a maintainer/committer on Apache Airflow and advises early-stage companies and several conferences (QCon, Data Council, and conferences under Skills Matter).

 

Twitter: @r39132

Episode Transcript

Ryan: Hey everyone, welcome back to the MAD Data Podcast. My name’s Ryan Yackel. I am the host of the, one of the coolest podcasts in data I think that’s out there. I don’t know, maybe, maybe it’s not. But I think Sid, I think that’s why you’re here, right? You’re very excited to be part of this podcast I gueSids I believe.

Sid: Thank you for having me. I feel honored to be one of your guests.

Ryan: Well, we’ve been we’ve been trying to get together for a while, and there’s been a lot of stuff going on I think we last time we were trying to get together. It was like, man, it was I’ll do it next week. That was actually a month from now. And then like a month and a half from now.

Sid: It started before, you know, your company was bought so that, you know, congratulations.

Ryan: Yeah, that’s true. Yes. Yeah. Things have changed. Things have changed a lot in the in the Databand space being acquired by IBM. But but, you know, Sid is the chief architect and head of engineering over at Datazoom and so the title of today’s podcast is really lessons learned from building world class engineering teams both at eBay, Netflix, Linkedin and PayPal. Now obviously over at Datazoom where Sid is at today. And one of the things Sid that attracted obviously me to you was just like your huge career that has spanned across these huge brands in data engineering and especially data engineering teams that are out there that really want to be a part of these companies. You basically worked at almost all of them. It seems like it seems like you’re at all these guys, right?

Sid: Yes. I’ve had the privilege of working at some places during their growth. And for me, that’s the best time or the most fun to join a company because it’s sort of going through some sort of hypergrowth. There’s more work than people and everyone bands together to to solve common problems. And, you know, sometimes when they get bigger things slow down, it kind of blocks that excitement.

Ryan: Well at Datazoom, I know you’re in a high growth company right now. Tell us kind of a little bit about that. And I know that you were talking before, like your experience. What got you here today specifically? Some of the stuff you either at PayPal, but tell us kind of what Datazoom is and what you’re doing over there. And so the problems are solving out there.

Sid: Sure, I’d love to. So, you know, I guess one thing people joke about, you know, people who work at Netflix and such is that it’s like a logging company. And it’s sort of funny for people to hear that because most of the bits for the video delivery is not done through the cloud infrastructure that Netflix runs on Amazon. It it comes through the networks. But when you start playback of any kind of video, there is a stream of telemetry coming that’s that’s basically being sent between your device and Netflix is home base. And that’s sort of what they refer to as the logs, like the video playback logs. And there’s a very large stream of that that’s essentially coming back to Netflix. And a large portion of Netflix is infrastructures dedicated to managing the scale of that traffic and making sense of it through insights, especially real time insights. Other companies have also figured this problem out like YouTube, but there are a variety of public video broadcasters out there, public and private video broadcasters that want to deliver video everywhere, but they don’t have maybe the know how or the time or maybe the investment to figure out how to capture telemetry for playback, to figure out things like, are my ads running too long? Arm ads have been effective. Are my are my viewers getting bogged down by re buffering events? Are certain ISPs cutting off bandwidth? So that sort of information is what Datazoom is in the market to help with is to help bring sort of the know how from Netflix and maybe other companies of that sort to the general market for video delivery. Just QOL, which is quality of experience of video as well as just performance of things like ads.

Ryan: And you were saying, and if I understand correctly, maybe this is not the best way to describe it, but if you look at as you’re basically giving it, almost like you’re packaging up all of the best practices around video delivery and saying to customers, hey, you don’t need to go build this. Let us help you just plug in to what you’re trying to do. Is that a correct way to kind of view it?

Sid: So you can think of it as maybe two sides of this. The first site is, you know, the production of the video and putting it on CD and delivering that video to a customer. When the customer hits playback or play the play button wherever they are, that’s when we get involved. So at that point, the video player will send us a stream of telemetry. We gather it near real time and deliver it to. Where that video broadcaster wants that data. And we guarantee, for example, that we can deliver this subset content from anywhere in the world and that the data will be delivered without any loss. So that’s how we fit in. And sort of our the engineering side of this problem is twofold. One side of it is building a reliable SDK on every single for every single video player on every single platform. So every mobile platform, be it Android or iOS based proprietary platforms for every video player for JavaScript theory, for example, 20 different JavaScript.

Ryan: Script.

Sid: Players out in the wild that people use. So we have to sort of integrate, you know, we have to build for that matrix due to make sure we capture every signal out there. And then the other side is a very reliable, scalable, essentially high fidelity cloud based data pipeline, streaming data pipeline, where all of these SDK is can send their data and, you know, have that surety that it will be delivered without any problem.

Ryan: One of the things you were talking about was this zero loss latency that you learned over at PayPal. It took a lot of the the stuff you did there and applied it here. There’s like two things I want to talk about. One is like, how are you able to do that? That sounds really cool. That’s very exciting. And two, like you also have a philosophy over at Datazoom around this zero bug philosophy, which is very hard to do. Obviously, you know that from we talked about before being a performance engineer in the past and dealing with that. Like, could you talk about the two topics and the both? Those were really, really fascinating when we first went through the outline of this podcast.

Sid: Sure. Right. So one of the projects I worked at while at the People was to build change data capture for the company. So change data capture is the ability to capture all database changes and and then essentially generate a stream of events for each of those changes that can be consumed by real time apps. For example, if you’re using the people app today and you want to look at your activity, the activity could be a purchase you made online, it could be a money transfer, it could be a crypto purchase. All of these sorts of things that you do. Most people, when they’re dealing with any kind of financial app or payment app, they want to see the activity, right? They want to they want that peace of mind that the payment went through or the crypto asset was purchased and all of that sort of stuff.

Ryan: That happens to me a lot, by the way, like when I’m using things like that, PayPal or Venmo or whatever, I’m like sending an email because a lot of times my wife would be like, Hey, go pay this person money. I’m like, Okay, and I’ll hit the button. I’m like, Okay, I pay them. I don’t want somebody to bug me if I didn’t pay them inside that refreshment, refreshment to go through it. I go through it and I got through that so I could expand this experience at multiple times, I think during a during the month or I’m trying to pay people.

Sid: Yes, it’s very important. And, you know, people has all of this in a very large Oracle fleet. When I was there, the the data in Oracle itself was 20 petabytes. That’s 20 quadrillion bytes. And for many companies, that’s larger than their, you know, their big data Hadoop footprint. And this was actually sitting in sands supporting a very large Oracle fleet. And the problem was, you know, as people, you know, people’s used in 206 countries, it’s like every country the Department of Commerce allows a U.S. company to operate in there. You know, they use all over the world. And, you know, there is no down there’s no time of low traffic. It’s used all over the in every time zone. It’s really important to be able to scale that database for things like queries. And if everyone’s, you know, 500 million consumers and merchants are opening up, opening the phone to see their activity, is that going to be a database query? It’s very expensive. So the way to scale that out, the pattern is to just get an event stream from it and scale that event stream. And that was the project that I led. And, you know, they tried for many, many years, almost a decade, I think, to get it right. But when I was there, had the opportunity to build a very talented team that was cross-functional, and we could span both the database realm and the offline analytics room and also the real time apps from Realm. And that work, you know, involves not losing a single transaction, right? It’s built on Kafka ID uses. It used things like Storm, but it needed to be as reliable as hitting a database. It had to have the same availability, which was four nines. It had to be operable by a small and lean team. We had no support when we launched it and it just, you know. You have to work right and be very low latency. I think we measured latency at one point like in our core system was about 500 microseconds p95. So that’s extremely low latency. If you can guarantee that, you can guarantee that you’ve already won, but then you also have a guarantee you don’t lose a single thing and you can scale with traffic.

Ryan: That’s fascinating. I still like whenever I look at the 99999999 is when people say that I’m always like, I have no idea. I mean, obviously I think initially like, oh, two nines is great. And then like, oh, we need three nines, four nines. I’m like, man, like, that’s like I can’t even fathom that. I don’t even I can’t even, like, comprehend how the how. Amazingly precise. That is for a performance perspective that you’re that you’re able to do that, before you did that on paper, although like you said that they’re trying to do this for a while. What were the challenges that you saw kind of coming in that needed to be solved with, you know, maybe a different methodology or a different technology or what were the things that stuck out right away of, like why they weren’t able to do this until you kind of put this in place?

Sid: That’s a great question. That’s really a great question. This kind of has to do with maybe large company organizational dynamics and ownership. So typically the online database is owned by one team, and I wouldn’t say it’s typical. So in a lot of companies today, there’s one head of data and that head of data has both an online offline and you can call a near line sort of ownership. So, you know, he or she would own more transactional databases, like all the stuff that needs to be up all the time. It also owns some sort of data warehousing reporting VII side of things. And of course, there’s like the data movement piece that plugs these two together that was just being formed or created. When I joined people as their chief data engineer, they had different orgs reporting to different customers with different requirements, and they were essentially thinking like different orgs. And the goal was to spam these orgs and to to build a team that could speak like the language of each group and leverage the talents and skills of each team. And that was a real challenge, right? Because like the database people, the DBAs would speak in one language and the data, like the BI folks wouldn’t really understand that. And, and, you know, the tools, I think this is really also kind of what’s emerged as like a of a varied set of skills for these groups, right? The data warehouse and like the smart people and the people doing analytics, they’re using frameworks like Spark, for example, or Flink that already build in availability, scalability, fault, tolerance, all of that. It’s under the hood. They just use the syntactic sugar, the frameworks on top to build their apps, and that’s what they’re used to. Database people have almost no concept. I mean, I’m talking about DBS of these high level apps and frameworks. They’re concerned with very low level things like fault tolerance, right ahead logs in people. For example, the databases were like rack clusters that were replicating data through forget the technology, but they were like two types. There was either block replication or log replication. These are very detailed things and they were doing covers all the time. So that’s the other really interesting challenge that we faced was how do you keep a stream running 24 seven when upstream the upstream source is constantly changing. So we had to build this whole discovery intelligence whenever a stream stopped to find out where is the stream now running from for this data because it was a cut over to a different rack cluster. To give you a sense of scale. People’s databases have 70,000 different tables, 70,000 different tables. If you can wrap your head around.

Ryan: It, that’s a lot.

Sid: And the size of the company is around. It was around 12,000 in excluding customer support. And, you know, any of these people wanted to be able to stream any of the 70, 70,000 tables on demand to some destination. And that’s the self-service app that we built, the ability for anyone to come in and say, you know what, I’m interested in this table and I want it to go to either Hadoop or I want it to go to this Kafka stream that my app can read from. That was the scale that we were working with, right? A lot of tables, a lot of data, a lot of customers, very high expectations of uptime and latency.

Ryan: Was there any I mean, obviously you kind of coming together and rocking the boat a little bit. And obviously it turned out to be very, very well for for people. Was there any was there any, like, pushback that you were dealing with it, that you were like, I don’t know if this is going to work or I don’t think we should do that. Is there any I guess it seems like you’re talking about more of like there were some organizational people process stuff going on more so than the tech stuff that you had to kind of navigate.

Sid: I think with any large company you’re going to face that. I think that tends to be the biggest challenge. And I think that’s also why they had, you know, someone like a chief data engineer role running this because the engineer that do, you know, like me for example, has been like I’ve been working in startups and I’ve been hands on at even larger companies throughout my career. So I can speak to developers. I totally understand how developers think and I know how like the senior architects think and. And I can communicate to the execs in the room that this is what’s not working. And that was part of the challenge at large companies with many levels. Somewhere in the middle, the message is lost or there’s there challenges that don’t get bubbled up. And and really, like, you know, sometimes you have to escalate something to unblock something. And that wouldn’t that message will get lost upstream. So I think any company that you know where data is important, they need to have these leaders who can talk to engineers, talk to arcs, talk to a variety of folks, figure out what their pain points are. Take that set of pain points to the right. You know, where where the heads meet and see. You know, you your two teams are not working well together at this level. And here’s what we need to do to change it. And and this is part of any large company. It’s small companies. You don’t have this.

Ryan: Sort of problem. Yeah. Yeah. No, I. I totally understand the small company vibe, even even when companies, I mean, I, we’re just SERPs as well where I was a 10th person hired and then we scale 150 and then into 300, 500, a thousand in thousand plus, you know, obviously is in a whole new world, but even those jumps from 150 to 303 hundred or 500. Those are also big jobs. Huge, huge jobs. I mean, people think that even when you’re at 500, 300, you’re still operating. You can still operate in a way that it’s like, no, it’s there’s a lot more layers that are now in between that you’re going to have to navigate now. So I, I totally get it. The, the, that type of stuff doesn’t, doesn’t just happen at thousand plus companies. It happens pretty is our time pretty early as a after you get to like 300 people.

Sid: What are some of the things you noticed like during those jumps? I mean, I think those jumps are significant because they’re a3x4x ten x and each hop increase in people. What have been some things you’ve seen for me?

Ryan: I mean, so I’ve always been on the the the go to market side. So it’s I’ve been on the hey, we had this really cool product and we’re all very nimble to solve customer requests in a way that we don’t need to pass it through some big huge Aha board for a requirement. And then it gets pushed into the product management team then to decide, okay, these are the three things we’re going to do to justify doing them this quarter. It was like, No, hey, this client over here, they need this particular feature. Well, we can just roll into the next sprint and give it to them pretty quickly. And then what’s what’s interesting about that, though, is that you have a lot of these customers that love you for that. Initially, they’re like, Oh, this is awesome. Like all the attention you’re giving to us and be able to rapidly push things out. But then as you start to scale and you need more processes involved, you’re going more upstream to more, let’s say, more complex accounts. People can’t just drop and satisfy, you know, customer A when they have customers Q, Z and Y that are having totally set different requirements they have to go and solve. So on the tech side and the go to market side, that’s like a main one that I’ve seen getting by in I think initially for anything is, is tough for any big changes you want to make. And there’s a book recently that I just read that I think is really good. It’s called Never Split the Difference and it’s all about negotiation. And usually the townspeople think that negotiations a bad word or it’s more it’s more it’s like selfish, like, oh, you need to win the negotiation. And so that it’s self-serving to you when in reality what you’re trying to do in negotiation is you’re trying to influence the person to see your side of the perspective. And then when you see both perspectives, then a resolution comes. A lot of times walls get thrown up because people don’t understand your perspective. So it’s all about getting that perspective that’s clearly seen on both sides for then you to progress. Because a lot of times if it doesn’t, it is false. You know, it just falls flat. Like if it was though, if there’s no pushing forward and keeping things going with constant dialog around your perspective, it’s just going to die and people are not going to go to it. So that’s just a quick book recommendation that I have recently. It’s really good.

Sid: Thank you very much. When check it.

Ryan: Out. Really good. It’s based off of this. This guy who was an FBI hostage negotiator.

Sid: Oh, I heard about this book. Oh, that’s.

Ryan: Good. I’m telling you, it’s. His name’s Chris Voss. Really good. Really good. Sweet. And again, data engineers can use this. I mean, that’s all I’m trying to say. It’s universal in in applicability, not just, hey, negotiation for your next job or your next salary or whatever. It’s more of like, hey, I want to I’m negotiating my perspective versus your perspective. Let’s talk about this. And let me say things in a way that. Gets you to understand where I’m coming from.

Sid: And also translating it for them so they so they understand the value for them. Right. This is why what I’m saying is important to you, especially in a case like this. Hostage negotiation, right?

Ryan: Oh, yeah. No, I mean, that’s like a big thing, and we’re kind of derailed right now. But it’s good because there’s so much change going on in data engineering that you have to you have to manage change and you have to be agile in ambiguity. You have to do that. And yeah, I mean, one of the things they talk about is. But having or communicating a way that helps people understand that when you’re asking questions or you’re asking people to clarify things, you’re doing that for the benefit of you. Try to understand their perspective. You’re not trying to do it to back them into a corner to then say, Oh, I gotcha, I gotcha. Like, See, you’re wrong. No it’s about asking if they have a problem that can’t be resolved. Instead of saying, Well, we could do this and this, this could be, how would you go about solving this problem?

Sid: I like that, yes.

Ryan: How would you do that? And okay. How would you expect my team to solve that? Because a lot of times there’s a problem and people know they can solve the problem and people just want to go. You figure it out. Well, hold on. Let’s talk about how we could figure it out together. And then that way we’re both on the same team. And it’s not this person saying to this person, go get it done, because then they’re going to push back. So inclusivity around how you can solve problems together and just how you frame the question back to them is really, really good.

Sid: The Socratic method. Yeah, this is excellent.

Ryan: Socrates, right? Yeah. It’s exactly what it is. It’s exactly what he talks about. Socratic method of going about doing it. So but no, that’s a good that we have the other side of it on this. What’s really cool too is that there’s a guy, nane’s Damon over at FanDuel. He’s one of the data engineers over at FanDuel and FanDuel is one of Databand’s customers and I met up with him for happy hour a while ago. This is like pretty crazy. But I gave him a book, some book recommendation called The Making of a Manager, and it was based off of I can’t remember her name. It’s a woman over at Facebook, and she details her process of becoming a manager over at Facebook. And then so I got on the call with them. Yeah, they just connect to them and I said, Hey, I got another book for you. And he goes, Oh, I have one for you too. And he goes over it. He grabs the book and it’s that book. It’s the it’s the never split the difference book. And I went, Oh, my gosh. So again, practically, here’s a senior data engineer over at FanDuel who’s literally reading that same book that you and I are talking about as while it all came together.

Sid: This is super cool.

Ryan: All right. Trying to get back on track. But so one of the things that again, I want to get to this one one part here, specifically what you’re doing over at Datazoom around this concept of safety net and being able to have this concept of overlapping safety nets for high performance. And then we’ll get into it. We’ll get back into your talk. But I know you said you learned a lot from people over instant Datazoom, and this area right here was pretty cool talking about the safety nets you have in place.

Sid: Sure. So I think, you know, one of the challenges, right, in software engineering and it’s become better over time. But let’s if we step back, say 15 years ago, operations and development operated differently, right? So development were measured by how quickly they could get features out and ideally how quickly they could get features out with very low bar counts. Right. And operations manager risk. Right. So they they their goal was to keep systems up and running within the published vessels. And then in between you had QE and release engineering secure with vouched for development and release engineering would figure out how to take those artifacts and and push them into production with minimal disruptions. You know, because like if you think of operations and this is the way I think of operations, the only three major pillars. So the own change management, which is the way new releases get pushed to production, the own continuous improvement, which is all the processes and technology and innovation around making things better, faster, cheaper, and the own incident response, which is how do we minimize disruption to customers? And typically it’s measured by things like MTB of meantime between failures, MDR, which could be meantime to restore a service or to resolve an issue. I look at both of those, to be honest. So for something to work with very high quality, you need both science to work very closely. And I think the first like foray into this was the term DevOps, right where you know which which actually changed from maybe its original intent to being what it is today. Originally the idea was to take down this wall between ops and development and to make them closer partners. And this is sort of spearheaded by the folks at Etsy. Like, like I think all spa was, was one of the big proponents of this model and it morphed into well, the DevOps team will. Right. Tools. Put those tools in the hands of developers so that developers build and operate their own systems. Right. That’s what Netflix ended up doing. Netflix went with that model. They essentially, when I was there, disbanded operations, had a dedicated DevOps team that them awesome stuff like Asgard and other things. Chaos Monkey for one, pushed it into the hands of developers and then said, okay, developers, you build and operate your own systems. And that’s how Netflix ran for the longest time. The way I’m sort of taking this is that, yes, you do need a tools team. Some people call an SRT team, you need a tool, seem to develop great tools, but you put those tools in the hands of ops because you still need someone to own the risk. And any time there’s an issue in production, we treat it as what’s called a CFA contributing factor analysis. We don’t use the term root cause. There is no root cause to any one thing. There’s usually a bunch of factors contributing to the issue. So that it’s a team ownership problem, not a pin the blame on someone problem. We also adopt, you know, the Google methodology of a blameless culture. Everything we’re doing is innovative. We may make an error, but we learn from those errors and we’re never punished for those errors. The goal is to teach everyone how to make things better. Anytime there’s an issue, we launch a CFA and a CFA. Contributing factor. Analysis involves filling out a document with observations, some analytic analysis of those observations followed by recommendations, and we measure how fast it took to restore service. And then we measure how how long it takes to resolve the issue. And these are two different timelines, right? For three or four nines availability time to restore is very key to ensure that we don’t have this issue ever again. Time to resolve is important and that involves usually the deaf team writing code to ensure that the problem never occurs. So essentially with this model, what we found is that any time we do a. Like any kind of incident response. We have ops people there and we have dev folks there and we figure out, okay, how many of these things can be fixed through software and the ones that can’t be fixed through software? Can we have better processes in place in operations? And it’s a team effort. We come up with multiple ways to fix the problem and then we always say, All right, we fixed the root cause of like, let’s see this one issue. We fixed a few, we find a few of them, but let’s add two or three more levels of security, of just safety, so that if this approach, we just kind of feels there are two more safety nets to catch it as a result, like the system we run today has very high availability. It takes typically four or five things to go wrong at the same time, in order for us to have an outage or unplanned issue. And this concept of multiple safety nets is really key. You don’t leave the table just because you found the cause of this one issue. You think, okay. Could variant of this issue occur if something slightly different happened. Can we solve all of them together? Right. And let’s figure out ways to make add more observability, more alarms and better processes. Intelligence around deployments that will check the health and automatically rollback if there’s a problem. Also, we have blue green deployments now, so if we did a blue green switch, can we have automation that detects when we shifted traffic over? There’s a problem automatically switch the waits and the load balancer. Things that can be done through code can be done much faster than humans. And if you’re running a real time pipeline, like three and a half nines is like a minute and a half of downtime. I can’t remember exactly if that’s a in a thing in a day. You can have a minute downtime in a three and a half downtime system and that that means you have to make decisions and detect problems through software you can do through ops. It’s kind of a long, drawn out, I think, you know, answer to your question. But the basic tenant there is. OP’s that work together, solve everything you can through automation and the remainder through processes.

Ryan: Cool. No, that’s good. I guess for you too. You mentioned a lot about DevOps. Is, is. And again, I know there’s a lot of buzz words out there in the data space, like what’s your definition of data ops versus dev ops? Or do you think they’re like the same thing or not same thing? But yeah, it’s just another term, just kind of if I use it now.

Sid: That’s a great, great question. Right. So this is another funny thing. So the, you know, the people in software engineering came up with these great processes to, to deliver high quality code with a high level of collaboration with low latency, like in a high velocity test site for features. Right? And, and somewhere off to the side, the data warehouse folks were learning and doing something completely different. Right then the appetite for analytics grew and they just put all that pressure on these teams who to a large extent were, we’re not from the tribe of software engineers. So over time, you know, there is this retraining, right? What can we teach the folks in data warehousing and and buy analytics, big data about code hygiene processes for like code review, merging, you know, branch management releases, patching. What can we teach them about that? What what can we bring to the table for for data operability, for data observability, these two like concepts. And then, okay, we have this DevOps team over here that writes a bunch of tools that can be used by dev and maybe saris to, to diagnose and detect issues. Which of those can we bring over to the data warehouse side? And that’s sort of what I see happening to some extent. But, you know, large companies, different companies, legacy systems, legacy thinking, it’s not the standard, it’s not the status quo in the majority of industry. It’s still something that has to has to develop in a lot of companies right now. You have. DevOps, right? Or do you have ops? I see you have ops. A typical ecommerce system. Their goal is just to ensure that the site is up processing customer interaction on the data side. What tends to happen is the issues that occur require a follow up action to like clean the data and notify all the subscribers of that data, have them rerun their jobs and re clean and all of that sort of stuff. So data ops, like what I’ve seen generally, how it generally operates is really pipeline management, right? People have 50, 60,000 jobs running a day. Something goes wrong, identify the owner, identify the subscribers, manage that whole interaction. It’s pretty painful. And that team tends to be an interest. They tend to throw people at this problem. You can have companies with thousands of folks in data ops, and they’re a combination of enabling new pipelines, managing pipelines, all of that sort of stuff. Really what we need is better tooling right around detecting problems and fixing problems today. There’s a focus on data observability, like find data issue as early as possible. What would be nice is a notification chain, a linear chain of subscribers of that data. We talk about metadata management and how data is related to data, but what really we need is how consumers are related to the data and how do we manage that when data goes bad and that’s still sort of missing. It’s nascent today. And because it’s missing, you know, you throw people in the data ops teams and just have them manage fires and I don’t think it’s very sustainable.

Ryan: Yeah, I know. That’s a tough one. I mean, that’s like one of the the main the main use case or the main thing that you described is we talked to a lot of people in Databand being about that, which is how can you detect earlier and resolve faster the data. But one of the things that I think you’re highlighting, too, is that people are so like trying to catch up to figuring out how they’re going to go and solve this observability. Right now, it’s still a relatively new space. And so I’ll give you an example. I was interesting. I was talking to some people at Databand the other day and I said, Hey, check out this LinkedIn post. And it was a it was an engineer from Meta or Facebook. And he was describing a process was very common in data incident management, which was exactly what you just said, which is like, hey, something breaks. They don’t know where it is. They got to go find the person who owns the task or owns the pipeline. Could take 4 hours or more. Then you try to figure out, Hey, well, how do we resolve this? It’s like. And I was, like, talking to people, too. It’s like it seems like the data side. And you tell me if I’m wrong on this for the data side, since you had all of this like focus on DevOps and that’s like pretty mature or it’s like pretty typical. No, no about it. Now, a lot of software engineers are moving over to become data engineers, but they don’t have the rigor that the software engineers had on the software side. So like all the things about software quality and and software reliability, now they’re trying to do the same thing around data reliability and data quality. And it but they but it seems like they’re like five years behind the dev ops group. I don’t know if you think that or not.

Sid: I completely agree with that. And, you know, I thought about why that’s the case. Right. So the way I look at it is when you’re building a website, let’s say, I don’t know, LinkedIn or something, you have you sort of control the UI, so you control the input that’s coming to the servers. The servers have a contract with the database and they write to the database. And if something fails, it happens right at that time, the database won’t accept garbage data typically, and you fix it. And and that’s measured by uptime downtime. Right. In the other space of data warehousing, there’s data coming in with a loose contract. There’s a loose bunch of engineering that occurs the moving parts because written to a target which might also have loose guidelines, then you find out data is bad, it is broken. And. First of all, the company may not look at this as a severe uptime downtime issue because they think that consumers of this data are internal users. They’re not the subscribers that pay to use our service. So the you know, we can resolve this a little later. It’s not like the focus of operations and reliability. But it’s a huge problem. And and now, you know, you start trying to figure out, how can I improve reliability in the system? How can I, like, improve accountability, reduce time to restore in time to resolve, like take it more seriously. Treat the internal customers as important as the external customers. And that change has has has happened over time. But usually what happens is that data systems are built with no real interest in them. Right. In the sense that, you know, here’s a team when you start integrating, get to take some data from your database and put it somewhere here to some internal customer, ask for it. And you may say, well, you know, to do this right, I need this number of people and this amount of time. And they said, no, no, it’s not important. It’s not like a customer, external customer. It’s an internal customer. Here’s, you know, here’s a minimal budget and just build this thing out over time. That data footprint grows, the pipelines grow, and then the customers become more important and more vocal. And they say there are problems with this. Now you have to go back and fix it. But data is already flowing. Some contracts are already in place. You can’t just change a schema. You can just put things in place. And that’s a that’s a problem afflicting most people in data engineering today. The ultimate problem, which is due to a system and process that wasn’t designed with forethought or like the importance it was it should have had at the get go.

Ryan: Yeah. And that’s like a that’s a good thing, right? I mean, it’s so I have like I have this conversation a lot. It’s because I was. Yeah. Remember you talking about your experience also being on Application Performance Engineer at one point.

Sid: At Siebel Systems, I started my career.

Ryan: And you know, you being, you know, you having that experience. And then for me working in the test automation side, these same questions come up. But now it’s in data. Like when I was at Test Automation Company Culture, I sometimes it was we’re always having conversations about, hey, you have internal applications with internal customers and you have systems of engagement that are external and both are important. And for some reason, well, the way these things are built, they kind of get into a period where there’s this rat’s nest of issues that have occurred, and now you have to figure out, okay, well, how are we going to make the services more reliable as we’re so pushing out software disguised as or so pumping data through pipelines? It’s like you’re fixing the train, you know, as it’s going are putting tracks on the train at the same time.

Sid: Right. And, you know, one thing that’s interesting about this is, you know, databases, you know, back in the day, they didn’t have a strict contracts on the data that was put in them. And over time, you know, they’re like, okay, well, if we ensure bad data doesn’t get into the database, we can limit how many consumers of that data we have to sort of work with, right? The database will solve that problem, but the data warehouse data integration world didn’t solve that problem. They said, oh, we want everyone to be able to push their data because we don’t know how they’ll use it. So there’s less of an opinion. The schemas are loose, they’re problematic. There are all sorts of issues there. The metadata management is a challenge. The, you know, is this data fresh or not fresh? All of these challenges are there that don’t exist in the database world because a database will reject data that doesn’t fit a typed schema definition. So it would never enter the database. And therefore, as you said, data, incident management, that team doesn’t have to now track down all the consumers of the data and explain to them how they have to change things. But in the data integration world, 100%, this is a big cost.

Ryan: Yeah, I was speaking to speaking our language of what people talk to us about. It’s like, well, one one last area I want to talk about before we kind of wrap up. Here was your time over at LinkedIn. I know. And the reason why I want to talk about over at LinkedIn, because everyone uses LinkedIn, everyone’s on LinkedIn. And it’s cool to hear from somebody who actually worked on some of the core features in LinkedIn. And you’ve also got some cool insights of why Microsoft bought LinkedIn that when you were telling me about it, I was like, Yep, that’s what I’m doing now. I just got acquired by IBM. Yeah.

Sid: I mean, I mean, to be honest, I, I don’t know exactly why they bought, but I can see the value that they would have gotten from LinkedIn. When I worked at LinkedIn, I spent about half my time in the search team and I owned or I owned the feature where you start typing, you know, one for each letter that you type in. Complete. It’ll show you a set of results and that is called search type ahead. It’s like near real time. It’s super efficient. It is federated. So as you start typing in characters, it will show you matches in people, in companies, universities, job posts. It’ll show matches for all of that. And it, of course, uses a relevance algorithm that will present this to you in a in the most relevant fashion. When I first joined the team and took ownership of that product of that infrastructure, it had a lot of challenges. For one, for example, the first version was built on an open source project that someone in that team had built. But the key part about how data was distributed was that it was distributed using not hash partitioning, but like range partitioning. And so when I took that project over, I remember ops having to add new machines every week because to handle the growth of customers as new customers joined, new ranges of customers had to be indexed. That was kind of nuts. And, you know, there were outages and outages would be kind of funny. Like if I connected with you, for example, and I searched for your name, you will show up that says my first degree contact. But for you, you might see me as a second degree contact. That was very strange. And Jeff, you know, used to, like, send emails and say, why is this happening? Right. And there were a lot of like reliability issues with this piece of software. Also, it had like core to it. It had a couple of lines of code that could cause corrupted data. So when it was writing data out, if there was any sort of issue, like some sort of exception thrown, it would essentially corrupt the the did the data file and we’d have to reenacts all of the data and like remixing all of the data was extremely painful. At that time we had 250 million users and we supported what’s called up to second degree matches. So like you can search for someone and you will find somebody in your first degree like you’re directly connected to, but also second degree. If you think about how networks are, if there are 250 million users and each of them have a, say, ten connections, that’s 2.5 billion edges. And if each of those have ten edges, that’s 25 billion edges in this graph because it’s second degree. And then you’re looking for, you know, like sub millisecond matching. So it’s a challenging problem. So what we ended up doing was we we were very privileged to have this guy named Shriram come from Facebook and he worked on Graph Search Unicorn, the unicorn paper at Facebook. And he had also spent some time at Google. And, you know, he came to our team and said, okay, let’s try to improve as a search and also type ahead. So one thing that, you know, LinkedIn, this is circa 2014 or even 2012 was really good at was search. Some members of our team had extended LUCENE. LUCENE Back then. LUCENE 3.0. I think maybe that that’s the version didn’t have like real time indexing. Like you could send data to LUCENE and every some number of minutes that LUCENE, which is a search engine, would spit out a segment file, which meant that you can now search for something that was indexed. These things were not really in LUCENE, but some members of my team like developed, some called Zoe and Bobo. Zoe was real time indexing over LUCENE, and Bobo was like faceted search. So we, we knew how to make LUCENE better. So we decided to build our new version of the search engine on top of the scene. We added graph search capabilities. We added something. We had a page rank we borrowed from, you know, Google for page rank and early termination of queries made the whole thing super fast. For indexes like index distribution, we use BitTorrent. That was like super cool. We had a fleet of search engines. Every time a Hadoop build was done to rebuild an index, we use this. We used BitTorrent to distribute it to the entire fleet. Is is a pretty exciting experience overall.

Ryan: And you’re saying that like one of the one of the main use cases of LinkedIn is it’s basically a company glossary for everybody. That’s your at your company. Right. Like you’re saying, the majority of your direct actually done.

Sid: Yeah. So at that time and things might have changed, but something like there’s, you know, two types of search there is like what? It’s a recruiter search, which is I forget what the term is, but essentially it’s like very long, complex queries like investigational search or something like this. And then the other side of. Searches, just name searches, like you’re looking for a name. You know the person. You’re not looking for someone who’s a software engineer working in Atlanta. It’s not it’s not that type of query. It’s it’s like a navigational query. You’re just looking for a name and matching the name. So I worked on the name side and 60% of all like name searches were by people looking up people in their same company. LinkedIn was used as a de facto directory search. And, you know, that’s why it makes sense for a company like Microsoft, you know, that builds software for enterprises, Active Directory and such to, you know, to get value out of something like LinkedIn, where 60% of searches are just looking at people in your company and what they’re doing.

Ryan: Yeah, I know. Sonia earlier. What do we for the podcasts are is like, that’s exactly what I’m doing right now, an IBM acquired database. And I’m like, I’m on this call and there’s like ten people on there. Who are these people? So I put them on LinkedIn. I yeah, I guess I connect to you, so I might, you know, we’re, I wouldn’t even know right now to go to look up my internal directory at IBM right now. It’s the quicker I go to LinkedIn and just type them, there they go. I know their whole history, what they do.

Sid: Yeah, LinkedIn’s great. It tells you who’s joined the team. You know, once you connect to that person and in. In your company, like you were just acquired, they, you know, you would connect with them immediately. LinkedIn will give you recommendations on other people. You should connect with that. It sort of knows who you should meet in the company. And similarly, for that person who connected with you, they’ll get a list of names of people from Beta Band additional.

Ryan: Yeah, that’s awesome. It’s always good to see how the inner workings are actually you being on that that search project and seeing how people actually use it and understanding the background of it. I think it’d be really cool for people are saying on this podcast, well, you know, we’re coming up on like 50 minutes and I think we’ve done a really good job. I think getting your backgrounds and what you’re doing a Datazoom, all the cool stuff we’re doing over there, seeing how you’ve taken all the lessons learned over at PayPal and LinkedIn and Netflix and all these other companies that you’re at, I think that you were really going to get a wealth of knowledge out of just how you were able to to use that knowledge from company to company, the company to get to where you’re at today.

Sid: Well, Ryan, thank you for the opportunity. And it was great learning from you as well. Thanks to the book recommendations. I’m going to check them out.

Ryan: I’ll definitely do that. And real quick, how can people connect with you? Do you have like a Substack or LinkedIn? I obviously have a LinkedIn, but anything else that we connect with you on?

Sid: I think LinkedIn would be ssid on and on. LinkedIn would be a great way and I have a blog that can always subscribe. I’m on Twitter, I’ve got a funny Twitter handle. People always ask me about my Twitter handle is R39132. So there’s a funny story about this I’ll share when I when I let you know I graduate from college, you know, with big dreams. I joined you really big company, Motorola. I was a chip designer at that time. Okay. And my email ID was like, R39132 at Motorola dot com. And I’m like, What is this? Right? And then I’ll get emails and the emails would be like, Ah, 350 561. And it was just a bunch of robotic numbers for everybody. And at that time I was meeting Dilbert and I swore that Dilbert worked at at Motorola, because so many things matched it. And I’m like, if you, if you ever wanted to feel like a cog in a big system, you know, join this company and you’ll be given a number. You’re like, just a number. Your name is not around. I wouldn’t even like it was like, so insane. It wasn’t until like four years later that they introduced like names and the ah letter, the first letter was like the sector you are part of, which is any division making over a billion. So I was part of the semiconductor sector, so R was for semiconductors. But if someone was like a cellular infrastructure group, it would be like C or P or something. If you’re in government like satellite division, you’re like a different like a P, I think. So you just have these like funny things like P 5531 and you’re supposed to know who that is.

Ryan: That’s funny.

Sid: Oh, yeah, you’re. And so at that time, you know, everyone carried when you joined, you were given a box of cards and those cards were to hand out to people at a company so that everyone knew your email address.

Ryan: So you had to make a match. The card did a bunch of cards like Look Up and you have like this analog system. But then you had to go to the process.

Sid: That, you know. Yeah. Oh you meet someone at a company function. What’s, what you email. Here’s my card. You needed that part. There’s no way to figure out there’s no lookup service.

Ryan: That’s funny. It’s almost like they were like, we don’t know about this email thing, so we’re just going to give people random numbers and see if it catches on. If not, it just that’s like, so funny to me that they, they, it’s like the opposite way of going about doing emails. Like, obviously your name is the best way to figure out where you are now.

Sid: You know, it worked. My Gmail is R39132@gmail like I don’t get my spam. They just think I’m a bot.

Ryan: Oh, yeah. No, that’s good, right? It’s true. That’s true. Well, actually, it’s funny when you were when we were in I think I can’t remember if I got an email from you, maybe it was your maybe is your phone or something. You sent me an email. Yeah.

Sid: Yeah, that was me.

Ryan: That’s so funny. Well, hey, man, thank you again for being on the The MAD Data podcast. Again, you can get Sid on his blog, LinkedIn, obviously. Check out Datazoom for what they do over there and Sid hopefully we’ll we’ll do this again and congrats on your success over at Datazoom.

Sid: And congrats on the whole acquisition and best of luck and hope to catch up with you soon.

Ryan: Thanks man. All right. Take care.