> Episode Details

Hello Big Complexity: Is Your Modern Data Stack Ready?

Nick Schrock, Founder & CEO of Elementl, and Scott Breitenother, Founder of Brooklyn Data Co., discuss the evolution of data from Big Data to Big Complexity – what’s next now that the data industry has solved the problem of data storage? While the modern data stack has become embraced as every data team’s “must-have” to address ‘modern data problems,’ Nick and Scott muse on the struggles that continue to plague data teams and the next wave of potential in data infrastructure innovation. With one problem solved, a new era of possibility and complexity is now unleashed.

About Our Guests

Nick Schrock

Founder & CEO Elementl

Nick Schrock is the founder and CEO of Elementl, the company behind Dagster. Previously, Nick worked at Facebook, where he co-created GraphQL. Nick believes deeply in the power of well-designed developer tools to make engineers more productive, accelerate their careers, make their lives more enjoyable, and transform the organizations in which they work.

 

Twitter: @schrockn

LinkedIn

Scott Breitenother

Founder Brooklyn Data Co.

Scott Breitenother is an investor and advisor who specializes in building data driven organizations. He currently leads Brooklyn Data Co., a consultancy offering full-stack data and analytics team-as-a-service. He was employee #16 at direct-to-consumer mattress startup Casper and founded the company’s industry-leading Data & Analytics team. In a former life, Scott was a Management Consultant at L.E.K. Consulting (which is probably where he developed his love of frameworks and structure). He has a BS in Business Management from Babson College and a MSc in International Management from London School of Economics. When he’s not blogging about analytics trends at LocallyOptimistic, you can find him walking around Brooklyn with his wife and children. 

 

Blog: www.locallyoptimistic.com

Website: https://brooklyndata.co/

Linkedin: https://www.linkedin.com/in/scottbreitenother/

Episode Transcript

Honor: Hey, Harper, how’s it going?

Harper:  Going well or how are you doing today?

Honor: Doing pretty good. Excited to welcome our guests today.

Harper: Yeah, I think it’s a great conversation we have planned today. Really interested to hear what they have to share. You want to go ahead and introduce them?

Honor: Yeah, Scott. Why don’t we start with you? Tell us a little about yourself.

Scott: Well, I mean, first of all, I find it extremely unfair that Nick has like this professional setup with his microphone and like key, it’s like, I feel like I have to turn my lights up more. I mean, you know, I don’t I’m not even working with great raw material to start with, so I really got to do whatever I can. So, yeah, I’m Scott. I’m, you know, founder of Brooklyn Data, and we are a small but rapidly growing consultancy that focuses on building out the modern data stack for companies of all shapes and sizes. So we’re a fully distributed team. You know, we, you know, we’ve been using dbt and Snowflake and all those tools since the early days and and I’m just really excited about all that’s going on in the space.

Honor: Nick, we’re really excited to have you join us as well. Can you tell us a little about yourself?

Nick: Well, apparently the most important fact about me is that I have a new mic, which Scott is very jealous of. So thank you, Scott, for pointing that out. No, I will not do AMSR the whole time. No. So Nick Schrock and the CEO of Elementl, which is the company behind Dagster, very quick background on me. My crew is at Facebook. While I was there, I started a team called Product Infrastructure, which actually the goal of that team was to make our application developers more productive and efficient, and we build lots of internal systems. But we ended up building lots of Open-Source technologies that react and React Native came out of those groups, which are wildly popular JavaScript frameworks. And then I personally was the original creator of GraphQL and its tech lead for the first couple of years and the coauthor of the GraphQL spec. So that was my background got into data just by kind of surveying the landscape and finding, you know, what are the most important problems? And I was talking to companies and data infrastructure kept on coming up over and over again. And what I found was one of the biggest developer experience dumpster fires that I’ve ever seen in my life. And I am drawn to those like a moth to a flame. And the domain is incredibly important to society, actually. I think data assets that we all build and observe and maintain are the basis of nearly all decision making in both human decision making and automated decision making. So fast forward, created Dagster. Dagster is a orchestration platform for the development, production and observation of data assets and really looking forward to the conversation

Scott: I feel like his intro beats mine. I want to read mine now that I’ve seen what a good intro is, and not just because the auto audio quality, I mean, it’s just took us on a journey that was great.

Harper: I mean, the mike looked great in camera, but when he really leaned in and got that raspy ness out of the Blue White tunes that he lives, I was like,

Scott: Whoa, I felt like he was next to me.

Honor: Well, awesome. I love hearing all that and really what started this conversation. Actually, I want to say maybe with like two or three months ago, we were discussing the idea of the modern data stack. And is it maybe not as complete as it’s been portrayed as? And we talked about why don’t we bring a few perspectives into the room? Obviously, Nick, from coming from the orchestration angle and then Scott, you working with analytics and also with a variety of use cases at Brooklyn Data and then Harper, you coming in from the angle of data observability and how that fits into the modern data stack. So I wanted to maybe talk really quickly first history about the evolution of the data lifecycle and really how we got here. Should I maybe should we start with you, Scott? Like give us your take on how the lifecycle of data got to where we are today.

Scott: Well, OK, really start with the easy ones, right? The that’s the that’s that’s a really great question. I mean, I think you know. Everybody’s probably talked about the journey from ETL to a bunch, but I think to summarize it, it you know, the big change was when storage and computational power got so cheap that there was no like. When it was no longer a a a scarce resource, you can kind of just save and do whatever you want, like you didn’t have to worry about compressing or or anything like you could just, I guess, optimize for. I’ll save the data to use it later or flexibility not for like saving pennies on some sort of infrastructure costs. And so what you end up getting to a point is like now you’ve enabled this whole kind of world of ELT where you have all the data, your data warehouse, you have everything at your fingertips. And we’re kind of at a point where you really can do anything you want in this unlimited elastic cloud compute world. As long as you’re willing to pay for it. So we’re at this world where everybody is migrating to cloud data infrastructure. You’ve got great tools like Snowflake, dbt, Fivetran that really led the wave of this kind of ELT modern data stack. And and to be honest, it’s I felt like as an early user of the modern data stack, and even when we were starting consulting, we’d have clients that would come to us and say, Hey, we have a problem like we need attribution or sales reporting, we would say actually backup. You need a modern data stack. Now people actually just come to us and say, Hey, Scott, hey, Brooklyn Data, can you implement the modern data stack? So I feel like we’ve kind of gone mainstream in the sense that, like everybody is converted to this vision. But I don’t know, Nick, what? You have a hot take. What do you what do you have on the history of how we got here?

Nick: I mean, I don’t think it’s I think it’s a bit at best lukewarm. But the the no, because it’s very similar to, you know, I think you correctly identify that the underlying reality of the move to the cloud fully elastic compute was a huge underlying reality that change that was a tectonic shape in the ecosystem. And you know, you often there’s often people on Twitter at conferences where it is saying like, Oh, you’re just reinventing the world that Oracle had or that Informatica had 20 years ago. And it’s they are correct that the modern data stack, in my view, you know, there’s kind of two definitions. One is that it’s one set of technologies. It’s ingest, Fivetran or Meltano, Plus a cloud data warehouse, a transform layer on top of that dbt, plus reverse ETL and a BI tool. Right. And those are the you plop those four things, those five things together. That’s a modern data stack. I have a broader view and more expansive view of the modern data stack that it is rebuilding the data ecosystem. With these new underlying realities in place, those realities are cloud computing. Those realities are that the world of data is complicated. It’s only getting more complicated, meaning that any non-trivial sized company has maybe 100 different system of records that they’re integrating, and they are doing very sophisticated things with it. It’s a very challenging engineering problem. So they’re collecting tons of data. There’s this move to the cloud where you can incrementally adopt technology and then there is infinitely elastic compute. So another framework I like using is that really the last, you know, the last fifteen years, in my view of data, has been dominated by the term and movement of Big Data, meaning that there is a huge technical push to solve the problem of efficiently doing computations and making computations possible at mega scale. That’s Hadoop, that’s Spark, that’s Snowflake into the cloud data warehouses. That era has come to a close. That problem has been solved.

Scott: Remember the like the four V’s? Was it like volume or velocity veracity? Now you write, like I remember when when like the Big Data was hot, everyone’s like, you got to solve the four V’s and like every like the big font like fundamental issue. And now I was just

Nick: Scott, I was just getting going here. I had you had a whole thing going on? I had one thing going. I had the whole thing gone now. So, yeah, I know it was the big thing. Like it was this like, can you handle the actual? Can you actually handle the scale of the computer? That problem is solved, meaning that like I think like the IPO of Snowflake, Databricks come in all these technologies maturing represents the conquering of the Big Data problem. And the next phase of things is what I call Big Complexity, meaning that the world’s too complicated. We need technologies to simplify things now to get to where the modern data stack comes into play and what I view is that, you know, it is kind of the rebuilding up of data infrastructure from the ground up, starting at the cloud data warehouse and moving up the stack. And I think it’s like a methodology. So it’s like the embrace of cloud technologies, the embrace of managed services and most importantly, and most importantly, as as importantly, in my view, the embrace of engineering practices to tackle and control that complexity. And I think the first kind of beachhead there was dbt and the entire point of dbt is actually taking analysts and moving them to analytics engineers, right? It doesn’t dumb down the analysts. It empowers them with software engineering tools, which I think is super powerful. And I think that that notion of integrating software engineering practices throughout the data ecosystem is the way to manage Big Complexity will dominate the next 10 years of development.

Harper: Absolutely. And I agree with you 100% when it comes to the volume aspect of those, five is kind of being done to answer the storage problem. We’ve answered the ability for us to be able to manage that aspect of data. I think the velocity veracity variety, like the data quality aspect of that, those still exist out there because everyone has data quality issues. It’s not like a good answer that can be abstracted across the entire environment. But I think you have a make a really good point about dbt coming onto the scene. And I would even say that there’s a strong corollary between like the rise of dbt and the adoption of it and this conversation around modern data stack. And I find myself having this like love hate relationship with like the modern data stack term, because it’s absolutely useful for people to understand how they can get started and making good decisions and analytical decisions using these commoditized tools. But it really does leave out the aspect of the Big Complexity problem. But Nick outlines, right? Like if you’re only using these commoditized tools, you’re ignoring the fact that there is additional items upstream of what you’re looking at in the analytical engineering world that really affect what you can do in analytics when it comes to bringing that data in. I think that A. Article that everyone’s probably really familiar with is the rise of data engineering was written by Maxime Beauchemin and one of the quotes that come out of there that people frequently reference to this day is like, ultimately, code is the best abstraction there is for software, right? And the modern data stack is trying to give us a way to abstract it in these same like low code, no code solutions. And when I like talked with people about this, I’m always reminded of the quote from George Santiana, that’s like there are those who cannot remember the past are condemned to repeat it, right? So the whole conversation on rise of the data engineer is, OK, we’re taking ETL developers and we’re getting away from these Informatica big enterprise systems, shifting it to a way that we can do this instead of on bare metal servers, doing it in the cloud computing environment. And by doing that, we have to bring in the complexity that requires custom coding and taking software practices and applying it to the discipline of data. And as we moved further and further into this modern data stack conversation, I see some of us losing sight of where we were at. That caused us to move into this data engineering reality, and we’re always going back to that low code, low access through these interfaces. So I just

Nick: Harper. Got to ask you, what’s your definition of low code?

Harper: That’s a good one.

Scott: Is being recorded, so it is being recorded.

Harper: Yeah, Eitan when I say cut, just remember to cut that out and then we’ll stitch the words that work best together. My my definition of low code is that you have the ability to import a library and insert two to three lines of code. But then does the computational business logic that you expect to give you an output or an artifact that you would traditionally have to architect yourself? So dbt, for example, is a low code solution for me. You can go in there, you can write some information, but then it then gives you the model and the artifacts to then reuse that over time.

Nick: See, I disagree with that. I do not consider dbt a low code solution. dbt is a dbt is code. Do you, like, it is. So our typical users who integrate Dagster with dbt are definitely kind of farther along in their infrastructure journey. I would say the median is 600 dbt models, around hundreds of models, each of which are like, you know, dozens or hundreds of lines of SQL code. You know, like dbt is a high code and I don’t know how it goes. But like what? That which is not normal code environment.

Harper: So let me let me let me throw it back for you on the spot. Let me hear your definition look like what’s an example that you would use in this modern data stack?

Nick: Yeah. So in my opinion, low code is typically a good question. Way to get back on me because I don’t have precise definitions.

Harper: Yeah, now you know how I feel 

Harper: I have time now that I think, I think what actually comes back to you?

Nick: To me, one of the properties of low code is that you’re not interacting with a version control system. So typically, what I view low code is like web flow, for example, where like is drag and drop tool. The whole notion that you’re part of an engineering process has been abstracted away from you. I consider, like Excel a low code environment, probably the best one, you know, the most the most popular reactive functional programing system ever devised, by the way. But the, you know, so those type of systems where you can do software engineering esque tasks, but generally outside of a software engineering environment. So you don’t have know what version control, you know, and so on and so forth.

Nick: I would I would add that there is some sort of like syntax that allows you to go slightly custom, but it’s like a highly like humanize and prose written syntax, syntax and not code like, you know, like any of those low code tools will have some sort of just like you would do something a little custom, you can use some sort of like their own proprietary human readable syntax to do it.

Nick: Right, like Trifecta or something or

Harper: So like a Google Cloud or Data Fusion like cloud data fusion that they have out there. You’re familiar with it.

Nick: I have no idea what it is 

Harper: what that is. OK.

Scott: It sounds very cool, though. Yeah, I mean, I wanted to start with the name that I did a great job,

Harper: but it’s essentially GCP’s drag and drop UI that allows you to bring in tasks to do data processing where it’s so it comes to that low code, where you talk about you don’t have the version control that you have to manage, you have just the models that are in the pipelines that are on your Google Cloud. But you also have the ability to get in there and kind of define your own tasks through their SDK, an API. So but that is the definition of low code. Where do we see the no code situation like? Is there really a distinction between those two or do we just see it because like, it kind of rhymes and it sounds when they come off your tongue?

Nick:  I think I think there is a distinction for sure. You know, to me, no code is like a completely commoditized tool. I think like Fivetran is an excellent example of a no code tool that replaces a what previously was a custom engineering process with out-of-the-box components that require no real customization, like they require a little bit of configuration. But I would consider that like a no-code tool. I guess the other question is like the maybe the dividing line is if like a drag-and-drop graphical tool where like a Informatica is that no code because you’re not typing text is like the barrier between no code and low code, like a text box where you can type Python or some DSL. You know, I don’t know. I’m sure there’s some taxonomy out there that some I feel like

Nick: we’re probably debating.

Nick:  There’s got to be a VCR or no load, no low code landscape. If there is a wireless as a podcast that

Harper: we’re like in the process of brainstorming an O’Reilly book is what it feels like, right? Yeah. Yeah.

Scott: I mean, on the flip side, like, I mean, dbt to me feels like like a framework in an abstraction layer. Like, it’s like, lets technical folks write code, but not have to worry about some of the underlying kind of infrastructure and dependency management, and that’s kind of like almost aligns with this rise of the analytics engineer, which is a person that can write I described as engineer quality, skill based transformations, which means kind of dry code, well documented, great test coverage. And so like, it’s a very like I feel like the dbt user and that level of code is very aligned with that analytics engineer persona.

Scott: Is that the type of abstraction layer that you think is going to be necessary to tackle that Big Complexity problem that you talked about, Nick?

Nick: Yes, I think it’s like the first of many, you know, and that’s like, you know, when our users like not to be like this entire podcast about dbt, but like when a lot of our users by dbt. Yeah, right. Yeah, a lot of our users are we’ve heard the line more than once. Like what? What, what dbt does for SQL is what Dagster does for our Python is kind of like a line that has been I didn’t come up with it. They told me, and I was like, Do you want to become, we’re hiring marketing? You wanna come on, come onboard. So yeah, I think this like because then the other thing that dbt gets right is what it embodies the values that are in that in some of Maxime’s post about functional data engineering, about having, you know, side effect-free computations that declaratively say, like I produce an immutable chunk of data and then you can read computer over time and whatnot. So, you know, I actually things like that in the cloud environment is like the right way to program these systems. So we also kind of, you know, believe in that or doubling down on that as well. And I think like actually that that like technical viewpoint and philosophy is almost like part of the modern data stack that like you should think of, like producing like immutable data artifacts and allow for memorization, recomputation and all this, all this sort of stuff. It’s like the way data has to be produced. So I think the full embrace of this, you know, DevOps style things, functional data engineering and high quality abstractions targeted towards the persona that that matters to you will be repeated across a few domains and data infrastructure as time goes on.

Harper: Scott, you mentioned that there’s been a shift that you’ve noticed in your client base that instead of them coming to you like, Hey, we need to solve the problem, they literally just come to you saying, Hey, we want you to implement a modern data stack for us when they come to you with this ask, do you think they’re talking about this abstraction layer that is handling the Big Complexity problem? Or do you think that they’re looking for something that’s more on the low code side of things that’s easier for them to manage and easier to get started quickly and then just get the answers that they’re looking for? Where do you think like the general perspective is on modern data stack versus the three of us who have like a fairly technical view that want to have that ability to interact with a medium code environment if you will.

Scott: Yeah, I mean, I think when when people are looking, when clients come to us and they’re asking for a modern data stack like they’re asking for, you know, three to four key components, which is some sort of ingestion tool. Typically it’s Fivetran. But you know, sometimes we’re seeing Stitch a data warehouse and it’s either Snowflake or BigQuery, dbt for transformations. And then some sort of, you know, I would say, visualization and activation layer. And so that’s like, you know, Looker, Tableau, or Census or High Touch and like it, really, it’s almost like they’re prescriptive asking for a specific stack. Now, like I said before, they used to say, I need to solve a use case. And now, like, I you know, I remember like, you know, three or four years ago when everybody needed to hire data scientists and you said, like, why do you need to hire data scientists like I need to hire data scientists because everybody’s hiring data scientists like like people knew was thought it was important, but didn’t actually, you know, have a specific thing they need to solve. Like, I now see that a lot of folks saying, like they know the modern data stack is important and they want to just implement it. And what we help to do is help them to really understand why they need to do it. And you know, that’s powering marketing campaigns, using reverse ETL. It’s making data driven decisions. But I actually think it’s surprising enough when people are asking for something, they’re they’re kind of had this tool ecosystem in mind that they’d like. Like, it’s kind of like everybody’s accepted that, you know, with dbt being so successful, with Snowflake’s IPO going so well, like, it’s kind of reached a level of trust that even large publicly traded companies enterprises feel really comfortable. Just this is the stack of the future. We just want to, you know, build from build our v1 in the stack or we want to shift a generation or two to this latest attack.

Honor: So are you saying that instead of an approach where you’re actually solving a use case, you’re being given what they see as already a solution of what will address what they need? And do you think that that is accurate? Do they usually understand what their needs are?

Scott: And so, like it’s an interesting process, because before we used to understand their needs and then make the tailored recommendation. And now they’re coming in and saying we need a modern data stack, how can this help us? And it’s like, we don’t have to educate the importance the modern of data stack, we have to help them navigate all the solutions out there and the different options and how how it will impact their business. You know, it’s it’s, you know, no longer like, I mean, I remember a couple of years ago, you know, if we were saying snowflake or dbt, like, I had to actually like educate and sell these technologies, then, you know, like, well, dbt. It’s a small company based out of Philadelphia. They do a little bit of consulting and they’ve got this really great open-source framework and it’s getting a lot of momentum. I mean, and then snowflake, it’s is like rapidly growing SaaS company, you know, Series D value and like now it’s like, these are they’ve gone mainstream and people universally accept that these are great tools. So it’s like it’s been a huge shift in the last year and a half.

Harper: Nick, do you think there’s any risk or downside to that shift in that conversation where people are coming to Scott’s company and saying, Oh, this is the prescriptive data stack that we need as opposed to coming and saying, like, can you help us solve creating a modern data architecture?

Nick: I don’t know. Well, I can’t I don’t want to. It’s got feel free to jump in and correct me if I’m wrong, but I feel like the client said Scott works with, ah, they have like very fundamental problems. It’s like we have a business, we have tools we integrate with and we cannot count things like we cannot compute basic business metrics efficiently and accurately. And I have read enough case studies and know in the ether that if I want to count things accurately and understand my user engagement, revenue and those kind of basic analytical tasks, there’s like a thousand other companies that have used this kind of prescribed stack and solved that problem. So for me, when someone is like goes to Scott and is like, I need the modern data stack, they kind of are implicitly communicating that, hey, like we are at level one of like Maslow’s hierarchy of data like, we can’t really count things. We don’t understand kind of our core business metrics, and we want to be able to do that and get on the latest and greatest stuff. And I know lots of other people have been able to successfully do that. Scott, is that like a fair representation?

Scott: So I think it’s like. I think it’s a spectrum, so I think, you know, we’ve got the small companies that want to just count things, and then we’ve got the larger companies that are building complex internal effort to support complex internal data use cases or, you know, complex external data use cases. I think the. The one difference is now people, no matter where are they on the spectrum, they already have a good idea of the tools they want to use, so they can maybe like that I want to counting small early stage start ups saying, Hey, we want this modern data stack, but you know, we have like fairly large, you know, complex use cases of clients coming in and saying, like, Yeah, you know, we’ve been seeing a lot about Snowflake and we’ve heard a lot about dbt. We want to like kick the tires to see if it works for us. I do think it is that the conversation is even in the more complex side of things. People are. Yes, obviously it’s more custom, but they’re already coming in with a perception of these are the tools that we need to make. They come with the short list already right, opposed to coming with a problem like it might be. You know, the count things might, you know, come and say, we want this exact data stack or this short list, the more complex ones would like. You know, we know we have to level up a kind of our stack, and here’s our short list. Like, I just think the big shift is that people already come with the short list now.

Nick: Yeah. Speaking of that, how many people come with the short list that includes a reverse ETL or operational analytics tool these days, like if out of your next 50 customers say, how many would you estimate are come with you or that you end up recommended recommending an operational analytics tool?

Scott: Yes, I would say one out of 10 will come with it. I would say it’s like 10 percent will come with it. I would say, you know, probably. These days will implement reverse ETL, it’ll probably 25 percent of the time, 20 to 25 percent of the time. And I imagine about a year will be implementing about 50 percent of the time.

Nick: Super interesting. Yeah, because that really does change. You know, I kind of like was maybe a little dismissive about the count things, what not, but I think adding reverse ETL tool really changes the fundamental dynamic where this is no longer just a place to count things and produce dashboards, but a way to interconnect all your systems of record in a pretty novel way? I think is a really interesting space.

Scott: And it’s I think it’s interesting because the first and most like straightforward use case for these reverse ETL tools like Census and High Touch was marketing like, you know, it’s always been a pain in the butt to get data into your CRM like, you know, use it. And now it’s like you’re pushing data into product analytics tools like, you know, Amplitude, a Mixpanel like, let’s take you know, the into the data that we’ve got there and push it to kind of a specialized kind of analytical tool like this product analytics tool that was on its own island before is now kind of get, you know, getting integrated. And so it’s just like a very interesting like. To your point about, Nick, about complexity from from earlier, I think even the simplest stacks are actually getting more complex than that. You know, there’s you’ve got this like kind of beating heart of the cloud data warehouse at the center of the modern data stack, you know, before people just used to like put some sort of monolithic BI tool on it. But, you know, now people might have a primary BI tool like, you know, Looker or Tableau, but they might have Hex for kind of, you know, notebook kind of visualizations. You might be pushing the data into your email marketing platform. You might be pushing into your product analytics. So like there’s like so many more downstream use cases now in this modern data stack that it’s like, I don’t know, I wouldn’t call it fragmentation, but I guess it’s like specialization. And it’s very interesting. Like, I’m no longer just building for Looker or one monolithic BI tool use case anymore. Yeah. And then I guess on top of that is that when you start to have those use cases, you actually have like real SLAs like, you know, if I’m slow on the ETL or something breaks like. Lookers dashboards don’t load, or maybe it’s stale data. Now we’re getting to the point where, you know, if my ETL, if my kind of data pipelines don’t run or my dbt jobs don’t run, my email marketing doesn’t send out or like, you know, my Facebook, you know, my retargeting group segment doesn’t get repopulated in Facebook or my universal suppression list doesn’t get updated, went to an opt out of apps out of marketing communications. And so it’s like, for me, it’s like kind of feels like it’s like it’s great because, you know? I help people build modern data stacks. People are now getting more value out of modern data stacks. But it’s scary because, you know, I don’t think, you know, companies have historically built the teams or invested in the tooling and the infrastructure to run production workloads on a modern data stack. And so you’re getting kind of these scary situations like, you know, people are doing a, you know, pushing some code into a dbt model that like they might not realize that is actually going to accidentally push 10,000, you know? German subscribers into the American list that have opted in, and it’s like it’s like. It it’s a little bit of a anxiety ridden, scary place right now for some of these.

Nick: Right? I mean, you move beyond just analytics and machine learning, you know, it’s no longer just about reporting. It’s about driving things, mission critical business processes and business applications.

Scott: Yeah, 100 percent. And like, I like that, I mean, that’s like, you know, that’s where I feel like tools like, you know, you folks are like, Yeah, I’m like, I’m kind of struggling right now when I’m building these modern data stacks to have the tooling to orchestrate from end to end, to observe what’s going on and to like, detect these things before it happens. And like, that’s like, I feel like we’re going to see

Nick: If only you knew people at companies who work on those problems, Scott.

Scott: Yes. It’s like, you know, it’s like, these are the interesting problems. This is what you got to solve. Help me.

Honor: Nick, while we have you here about your most recent talk. I’ve been wanting to ask you this your Open Source Data Stack Conference talk on your core beliefs of what’s happening with the future of data, tying it back to what you mentioned earlier about the Big Complexity. Can you recap for us what you shared in that talk?

Nick: Oh God. I mean, I don’t remember exactly what I said, but I can probably approximate it. You know, I guess though, though, it was what I was speaking to before about. I view the modern data stack more as an approach and a methodology than a prescribed set of technology or technology categories. So just an example two years ago, you know what Scott said, like ingest tool, a BI tool, a transform tool to the cloud data warehouse was a modern day stack, and now we’ve added reverse ETL as a new first class citizen. And I think we’ll be adding more and more first class citizens to the modern data stack because what’s going to happen is that someone like Scott use they solve Brooklyn solves the first set of data problems, and then those companies will outgrow those sets and have new problems into solving. Scott’s team will receive a request from a client that says, like, I need to solve X, and they’re going to have to find a tool to solve that. They’re going to find a tool that works well with the rest of that stack, both in terms of having integrations built, but also kind of like philosophically, it fits. And so what I really think is that we’re seeing the rebuilding of data infrastructure along new dimensions with new principles and new constraints. And I think at the core that applying software engineering principles to data and where, you know, where appropriate is going to be the the future, right? Like even, you know, Tristan from dbt gave a great talk recently about, you know, effectively that the future, you know, effectively bringing the talk about bringing the lessons of DevOps to data like, never click a button, only deploy things with code. Right. You know, his argument was explicitly not no code. It was like you should have code to determine your infrastructure and whatnot. So, you know, I think that the modern data stack is a set of technologies. The way I frame it is you apply that stack to build data platforms at companies. That data platform is responsible for the management of all the data assets at a company and that it needs to be driven by engineering principles and making engineers more productive and the entire organization more productive is going to be the only way to manage these things going forward.

Honor:  Harper, where do you think observability sits in this rebuilding of infrastructure or this new infrastructure that Nick is talking about?

Harper: Yeah, I I tend to talk about this in like the good old sports analogies that I enjoy. I can add to my my love of baseball, right? I don’t really care about like the sport on the field if I’m honest, like I don’t care who wins the World Series. I love the numbers that are going on there. I love the outcomes that come out of that and being able to predict what’s going to occur. Like, understand the the fabric that exists. And that’s the way that I see that observability kind of looking at the modern data stack and the retooling that Nick and Scott have talked about here. We have these different tools that are coming into play that are addressing the different use cases. But as we continue to see those increase of downstream use cases that’s gotten Nick talked about earlier, you have more and more stakeholders that are invested in the data that’s coming in and what that data is going to be telling them. Because if you have a client coming to you saying, like, can you solve X and then you say, cool, here’s my solution X. And then three months later, they say, actually, x is solving Y. So what do I do now? If you don’t have a way to understand what’s going on there, it’s difficult to do. And we’ve seen the orchestration layer come into play over it like we’ve cron jobs evolved into Airflow. We have Dagster coming into play here that’s doing for Python, what dbt did for SQL, right? And then also giving us the ability to be more data aware as you move through the process here and connecting those tools that are addressing those use cases. But the next step here that I see is what happens in those interfaces between those tools, what happens in the fabric that’s covering those interfaces between those tools and being able to see what’s happening to your data in near real time is going to be extremely valuable when it comes to resolving an issue. Whenever your solution for X is now solving for Y and you know, at Databand, what we are focusing on is the actual movement of data in those pipelines. Understanding that when data comes in, how does it match the profile that you expect? How does it actually affect the data that’s already in your lake or in your warehouse? And then understanding when those changes occur, how you can resolve those, right? Like there’s like lakefs is an interesting part here where you get in the data versioning and being able to roll that back quickly is a really interesting use case that I think about a lot. But that’s where I see data observability fitting into the modern data stack is understanding why and when something went wrong and then being able to quickly get to that root cause and roll back the change that occurred, which caused the breaking change, which was the issue with your dashboard. It’s that same mentality that Tristan talked about where it’s taking DevOps principles and finding ways to do that in a way that still empowers your software engineer to own that. But it’s not something that’s abstracted away from them that they can’t manage themselves, if that makes sense. Thoughts on this, Scott. Nick, I know you’re not specifically in the observability space, but it’s such a nascent space that it’s fun to hear everyone’s opinion about what’s going on. There’s different approaches proactive, reactive. And so what’s what’s valuable to you all? I’ll go with with Scott.

Scott: I mean, I might not be in the observability space, but I’m in the my phone rings when something’s broken space and so that it is the outcome of the.

Scott: Pretty, pretty related. And, you know, complementary to the observability space. Yeah. I mean, I think it’s really interesting because you have at least a lot of the folks that that I see in our clients that we work with an analytics engineering space are not historically like back end engineers that have transitioned to data their kind of analysts that like found this part of them. And it’s like, Oh, wait, there’s this role called analytics engineer that all the great stuff I like, but not this stuff that I don’t like. And so they don’t have backgrounds in, you know, a lot of the observability CICD that that we typically find in kind of engineering best practices. And so like, I see that all this stuff is like universally applicable and helpful. And when when we add observability into the modern data stack like and CICD, like everybody kind of gets it and it sees the benefit. But I think the folks that are kind of playing in the like, the analysts and analytics and juniors in the modern data stack don’t know this is an option like they didn’t even know it’s available. And so like they’re used to this world of getting emails when things are broken, like of finding out a dashboard is the pipeline is broken because the CEO sent an email because the sales numbers are wrong in the morning, like in the morning report like. I think we all need to, and I’m just as guilty of this like, you know, we need to be educated because, you know, this is new to us in the in the kind of modern data stacks. So like, I’m really excited about it. I just think that there just needs to be a lot of education and training of the folks that are kind of working the modern data stack. Nick, what do you think?

Nick: I don’t know if I have strong thoughts on this. I guess the I think the interesting tension is going to be and I think we’re going to see this too because we have some data integrated data observability capabilities in the orchestration layer in Dagster, because this kind of makes sense to provide like a minimum layer because in our view, in the end, the purpose of these systems that produce data assets so you should be aware of that, but we’re not gonna. But there’s also going to be a space for cross-cutting tools because not everyone’s going to move to Dagster immediately and people will have specialized capabilities in the observation layer. So I kind of think of it, you know, I think there’s a struggle that continues to play out in data in terms of everyone like people are building best-of-breed solutions for everything. But then when they do that, they end up having to integrate ten tools in order to get a heartbeat going. And that’s rough. But then people also don’t want to be locked into a monolithic system. So I kind of view it. I come it from more of a vendor standpoint of I’m thinking about, you know, I want to provide some out-of-the-box observation capabilities for so that people can have an easy button for that, but then also appropriate possibility layers so that if someone wants a more specialized solution or needs across much of systems that we can plug into that and figure out the balance there, I think is kind of kind of interesting.

Harper: Yeah, I think that that we know

Scott: my easy one. Yeah, I want my I keep telling Nick, I want my easy button.

Nick: It’s in your email.

Scott: Yeah, I know there’s literally an email in my mailbox. My inbox from Nick is like, Here’s the easy button. Watch this VIDEO And it’s like on my to do list for this app. But it was seriously like because, you know, both of you hit the nail on the head. It’s like when we set up these clients, it’s like five tools on a cron nine o’clock nine, 10, nine, 17, nine 30, and it’s just like space of enough to like. Hopefully, you know, it’s fine. Or sometimes I have some. I have some clients that, like the dbt model, will run, you know, or like multiple times just to catch something just just because it’s. It’s funny everybody suddenly has access to all these great tools, but they don’t have, you know. They’re not linking them together, they are coexisting like, you know, when you have two young kids. They don’t play together. They play next to each other. And it’s just like like when you’re in the early, like you’re in a kind of entry level modern data stack, you have five tools. They don’t play with each other. They play next to each other. And so, you know, we’re in this world. If something breaks, it’s hard to replay. It’s hard to know where it broke. Like, you know, we’re looking for smoke signals like dbt test failures or look or dashboard that, you know, smell funny, like, you know, it’s that’s the space that I like. I’m best the space I feel at the moment, right?

Harper: And creating that that fabric and mesh that occurs between all of these tools is where there’s like this nice interplay between like the orchestration layer and the observer building, because it makes absolute sense that orchestration is going to address observability to a certain point exactly the same way that we have all of these tools that we’ve referred to in the modern data stack because there’s value in having a best of breed that addresses every type of use case that you’re looking for every, every, every struggle that you’re working with. And for me, that’s what’s fun about specifically focusing on observability because it’s such a nascent field and conversation that we’re still solving the right way to do that. And I think it’s really important that you focus in on like as far upstream and your data lifecycle as possible is. Yeah, because if you can identify issues early in the process, it eliminates that CEO email that says, Hey my dashboard, that’s what’s broken, right?

Scott: And your work email in the world to get it feel like of all the people who could be telling me my dashboard is broken, this this hurts the most.

Harper: Yeah, I feel like I feel like that’s like it should be like the North Star for all of data observability, right? Like eliminate the CEO email like eliminate the on call data rotation, right? Because if you can do that, then you’ve got a trillion dollar product at that point in time.

Nick: So Harper, I really like the phrase you just used and just to get it right. That orchestration, the orchestration should be aware of these concepts or be was that the word you use?

Harper: Yeah, I think that there’s a good interplay that orchestration should be aware of the movement of data that’s going through, but there is still a need for a vendor to address the understanding of the why of these things have occurred and presenting that in a way that makes it actionable. Like that’s that’s the ultimate outcome of of a good or dirty data observability tool, in my opinion, is not just telling you where it occurred or when it occurred, but making it actionable for you to address that problem.

Nick: Yeah, no, no. The awareness piece is good because I think you’ve actually provided me a tool to explain our relationship to the other domains of data in a way that, well, accurately communicates that we’re not trying to like eat every other tool, right? It is like, for example, we want our tool to be aware of kind of the concept of data quality just as a different domain. So that tool, a Great Expectations or something can plug into that and like be like, Oh, this is where you put your data quality test in the orchestration layer. But we’re totally agnostic to like how those data quality tests are expressed because like, yeah, Great Expectations. As a superb ADSL and we don’t want everyone to have to use that. They have their own thing that’s cool in the same way. Like, we want to be like aware of data observability so that there’s a place for observability tools to plug in and get our metadata exhaust so they can build interesting and novel things. So yeah, with awareness of these other kind of domains of tool in above us in the stack, I think is where we want to be.

Nick: That’s the fun part about all of this, like really being open source and finding ways for us to communicate together because by having those ports available to everyone around you gives you the ability to enable that awareness and then act upon that communication between the different tooling.

Honor: Awesome. Well, this was this was so fun chatting with you guys, can we? We’re coming up on time, but I do want to ask for one, maybe call-to-action from all of you. If we were to educate our space about this rebuilding infrastructure, what would you offer as a piece of advice and how to best equip data teams. Start with Scott.

Scott: Oh, brutal! oh, man. Where do you said you start? You said, Start with Harper, right? Sorry about that. I thought, Oh, I can.

Harper: I can fall on the ceremonial sword here. If you so choose Scott.

Scott: I welcome the sword following. That would be that would be very much appreciated. Give me 30 more seconds to think of some knowledge bombs.

Harper: Yeah, absolutely. Let me pull up my six dollar words here to try and confuse people so they don’t quote me on what I’m saying now. I mean, I think the the last point that we were talking about the awareness aspect here where you’re creating the ability for these tools to work together is going to be the best way for the data industry as a whole to address this Big Complexity problem. So if there’s one like call-to-action that I would give is don’t let capitalism get in the way of innovation, like it’s very important that we find a way to. So it’s very important that we find a way to grow our careers and grow our businesses, and that we are successful and fulfilled. But at the same time, it’s very important that we keep in mind that we’re trying to serve a community as a whole. And if we lose sight of that, then we won’t actually help evolve the data management lifecycle.

Honor: That’s noble. I love it.

Nick: Data workers of the world unite. You heard it here first. Wow. We went there. Now I think just I think that the question was, what advice would you give the data teams navigating this new world? Is that a paraphrase of it?

Honor: Yeah, pretty much.

Nick: Yeah, I would say that a little a little upfront investment and thought about how are you going to structure things can pay dividends, not just down the road, but like tomorrow. I mean, like, you know, if you actually kind of think of your system as like your own little platform and you think about like, hey, how does a stakeholder interact with it and just kind of do a little upfront planning and think about that while still meeting agile, you’ll pay an enormous, enormous dividends. Yeah. Yes, there is an easy button, but easy grab.

Scott: It’s in my inbox.

Nick: Email right, right, right. But the the easy button before the easy button, you know there is this like prescribed stack, but that prescribed stack is like the result of kind of a lot of thought and care. So, you know, I would just like, yeah, do, like, think ahead, like just a little bit before scratching stuff together because like you might like, it might save you a lot of a lot of pain, not just in the long term, but right away.

Scott: Yeah. So advice for, I got a few, actually, so I mean, I think, you know, one thing that I always kind of. You know, challenged my team is that like. Just now you gave me plenty of time. I’ve had the most time, so now’s the pressure. You know, it’s we have all these great tools to handle all sorts of like instability like data. The data sec is inherently a downstream process like everything we build is kind of like we live or die by what happens upstream of us and all the, you know, observability and the infrastructure and the and the kind of robustness that we’re building in is to handle all the kind of things that could come down into our data machine. Um, I think, yes, we want to build robust, fault tolerant data pipelines, but I think sometimes people actually forget that you can go down the hall, knock on the door of the data of the production engineer of the folks building those systems and actually give them feedback too and say, like, Hey, you know, could you change this or can you like, you know, we’re doing some sort of complex logic to it, you know, infer this relationship, but actually, it’s an existing relationship. They just don’t put the foreign key in the table. It’s like, Hey, can you do a one time backfill and then put the foreign key in this table moving forward? And it’s like. Makes things much more robust, it’s like just, you know, and then I would just abstract that to just like just generally make you validate that the problem you’re trying to solve is a problem and the constraint you’re optimizing and operating around is actually a constraint.

Honor: I love that. That was that was great. That was great, that was so fun!

Scott: Harper and Nick gave me time.

Nick: Scott, you know, there’s a good line about that constraint problem is Elon actually. I saw him interviewed and he was like, at, at, at, at SpaceX. “We never say that, oh, this team insists on having this constraint of this process in place. It always is a human’s name so that you know that there’s a person to talk to you because then you realize there’s no human. There was actually some intern last two years ago who put that, and then no one can defend it anymore, you know? So always attached.

Scott: He got that from me.

Nick: Oh yeah 

Harper: It’s the inspiration for the get blamed command. 

Nick: Well played. So well played sir.

Nick: And now I gotta get back to my inbox and check my easy one.

Honor: Thank you both so much for coming on. This was we had such a great time. And hopefully, I’m sure everyone has learned something from this recording. So thank you again. Have a great week. Take care. Bye bye.

Nick: Yeah, this is so fun.

Nick: Thanks for having me.

Nick: Okay, everybody. Yeah.

 

 

Additional related links:

The Data Supply Chain: First-Mile Reliability

End-To-End Observability Goes Beyond Your Warehouse

Stay Connected

Sign up for the newsletter