Episode Transcript
Honor Hey, Harper, what’s up, how’s it going?
Harper Hey, I’m just enjoying the last few rays of Sun here before a Texas winter settles in, trying to get the day to day grind rolling out. How’s everything going with you? You’re excited for our guests that we have today.
Honor Yes, I am super excited. We have Sudhir from Google. Sudhir can you tell us a little about yourself?
Sudhir Hello, everyone, I’m so happy to be here. I’m the senior director of product management for all of our data analytics services at Google Cloud. So everything from BigQuery, which is our cloud scale data warehousing solution to all the data processing services and messaging services, cataloging data quality, governance, management, all of that area. That’s what I focus on. I’ve been with Google for almost 40 years now.
Honor Awesome. Well, I’m so really happy to have us in the room together today, and a favorite topic of ours in this space, as always, is the modern data stack and really glad to have you here to give us a little bit more perspective on how Google sees this. But maybe we can start with how do you personally define the modern data stack?
Sudhir I think that’s a great question, and there’s a standard definition that the industry uses and I have great relationships with with George at Fivetran. And normally when when you talk about modern data stacks, it is Fivetran, some kind of an extract load tooling something like Fivetran, something that can transform the data or load the data into a cloud native animals like big query. Then you use something to transform the data that’s dbt or the company that we bought a year back. It’s called Dataform, but some kind of a transformation engine that runs on top of the cloud data warehouse. And then you have the visualization layer, something like looker for or getting value and building applications. I actually think about it a little more wider than that. I started coining this term called cloud native modern data stack. And what I mean by that is more and more organizations are moving towards real time enterprises. If you look at most of our large customers that we work with our partner with, they’re more focused on providing a real time, personalized experiences for their customers. And when you start looking at that kind of a use case, you know, extract and load is great when you are working with SaaS applications, and this is where a lot of these organizations started with. But when you look at the real time events coming from somebody on the website trying to click through things or some person trying to buy stuff in the store and you want to go out and provide them with a better experience, you do get these Real-Time events and you have to process them in real time and then make the decisions, whether it’s fraud alert or personalized offers or something like that. And when you’re looking at that, the modern data stack in my world actually shifts here to focus more on how do you collect these events, process them in real time, make real time decisions. And then of course, they have to land into a cloud data warehouse like BigQuery for further analysis and all. But I think I look at it more broader than just limiting to few use cases that that may be common knowledge in the industry. I do think we are famous important. And then how do you make the real time decision becomes more interesting in the future?
Honor That’s a really great point that we are moving in that direction. What kind of impact are you seeing from an angle of what? Google customers are going to want in product design as we move towards more of a real time need of alerts and identifying data issues and the entire host of needs.
Sudhir Yeah, I know, I think that’s a great question. So if you take a step back and go back to what I was saying, you get real time events coming in. You collect them in one of these systems, like whether it’s Kafka, whether it is some like some kind of a service from variable process, these large scale events and in some technology, which could be something like data flow, which is that cloud native service for doing streaming analytics or spark streaming. If you want to use one of our partners like Databricks, you can do that and then you’re making real time decisions. You would need a serving layer, something like age based or big table. You may want to then move that into BigQuery for creating machine learning models. And then you are going to go ahead and leverage those models for giving personalized experiences. Now imagine that environment and the complexity of the environment, right? The most important thing is when data moves across systems, you want to make sure one the quality of data is high. Second, any kind of failures of data in the system is caught really early because you don’t want bad data plans putting into all the different systems. All. So I think I do believe one of the key things are design principles we have to figure out, and the industry is actually going through. A lot of innovation in this space is around this quality of data moving across the system. So data in motion, how do we make sure that you have high quality? And all, I think, is one of the biggest challenges for organizations because one wrong or one misaligned dataset into one of these systems can actually cause bad experiences for customers. And that’s nobody. Nobody wants that.
Honor Yeah, for sure, and I feel like, Josh, this is precisely the area that you frequently speak about is data quality of data and motion, and how do we prevent these issues from cascading by catching them at the source? So in the context of what Sudhir is talking about, this real time need of all the different processes where we’re seeing, what are the implications for where data observability sits and how it will pan out?
Josh Yeah, well, first of all, I think just on the definition of the the data stack, I think I really agree with how Sudhir reframe this, which is for there really isn’t a one size fits all solution for every organization out there. There’s multiple different modern data stacks that are going to emerge in the market depending on different segmentations of companies. So we’re pointed out one interesting dichotomy, which is organizations that require real time use cases, which clearly are not well covered within that standard stack of tools talked about within the modern stack today and require heavier firepower from solutions like Kafka and other technologies. And that creates a level of complexity which will take you in some ways out of the structured data warehouse and include some level of data. Lake or the staging environment work where a lot of raw data can be pushed and then transformed in some way before it gets loaded into the warehouse in a more of a structured form. IE, the way that we will often talk about this dichotomy of different flavors of the modern stack is between teams that are more analytics weighted versus teams that are more engineering weighted. Now, of course, every data team is going to be doing some balance of these activities, but in certain organizations, there’s just a lot more requirements for tools that have more. Larger amounts of functionality that help structured, unstructured, semi-structured data get into the environment in the first place, which will pull the team towards a heavier engineering posture using tools like pythons or Databricks data up front. What we often see with in our user base as well, for example, are teams that work with lots and lots of data sources. The more data sources you have, the more likely you’re going to be self-managing. The ingestion for those data sources and all of your your data providers aren’t going to be covered by tools like five, 10 or HVR. And this also pulls data teams into more engineering oriented activity and more Python oriented activity. So this is another way that we look at that dichotomy. This also creates a difference in posture to observability. The more if you imagine the flow of data looking like a triangle coming from single sources or just a few sources, and then all the activity kind of burgeons out towards the analytics end, that’s going to tend to be where your complexity is, and you’re going to focus a lot of your observability attention over there, like how people are using this simple data or fewer sources of data that you’re pulling into your stack. If you have more complexity at the initial stages, so there’s more thought into how you get real time data and or how you get lots and lots of data sources and and your your data flow looks a little more like an hourglass as opposed to a triangle. The requirements for observability really get shifted to the left. They get shifted forward in the data flow, and it becomes more important to understand where issues are coming from, from the source. The reason being is because if you’re only looking at the tables or the dashboards downstream, your team will still be spending a lot of time trying to find where those issues are coming from. You might be able to catch a problem, but by tracing them back to the source and being able to quickly debug or solve the issue becomes really, really complicated. So that’s one of the ways that we look at this difference in observability, depending on the different stacks that a team might be using.
Sudhir I think just to add to that, Josh, if you think about I love the simplicity of the modern data stack around extract load and then transform and and perform things and which is great, especially as you were mentioning, you have a bunch of SaaS application, some standard databases. I think it’s easier to go ahead and run with that. But when I work with large enterprises and their environments are super complicated, wherein you have mainframes and we have one of the largest retailers that we work with and they’re one of their most critical data was actually still in mainframes, and we have to integrate the mainframe data in which the formats are not clear and all of that. And then then there are other applications which are enterprise applications, and it’s great to see an acquisition by by Fivetran wherein now they are entering into the enterprise space. But I think those kind of enterprise applications, not just SMP, but Oracle and various other applications, they have very different posture. You need to go to APIs to call them and you have to process them and all of that. And then then, as I said, the real time aspect of like, we’re in there, a lot of custom software being developed for e-commerce sites and various other systems that are being built. And how do you handle that data? And the sooner you need to make decisions and power real time applications, it becomes more and more interesting that, as you said, that processing is moving then towards the left on the data engineering side in stuff analytics, engineering. And so it’s amazing that we have coined those two terms now. But yeah, you’re right. Like, I think there’ll be organizations that will be heavier on data engineering and running things through all of these complex environments, doing processing, making the data ready and then finally going into the analytics layer. And then there’ll be organizations that will be moving data directly into the analytics layer and doing an index engineering on audit, saying the way we think about it is it doesn’t matter whether you’re doing spot processing or anything versus doing a sequel based processing in BigQuery. We want to have a single unified storage idea that you can just move around like some of our largest customers like Twitter and all move all of their data sets hundreds of petabytes into BigQuery storage, and then they run Spark processing. On top of that, when they run data flow with stream processing on top of the same data assets. And then they also run sequel schedule queries and all of that. So you see different engines fitting on top of the common unified storage tier. But then the complexity of how do you manage? The environment exam and making sure that you have there as you are talking about observability, like how do you make sure these pipelines are running seamlessly across these technologies becomes a big challenge for consumers.
Harper I think one of the things that’s really exciting to have the conversation shift back toward the left hand side of the pipeline is that we’re seeing the effects of kind of the problems that existed in the data space for a long time. The modern data stack, in my opinion, has come about from the biggest pain point that data teams had when they were on prem. It was always the velocity to deliver. And so you see these products coming out and they were coming out in their commoditizing the ability to provide analytics, engineering. And so it’s great that we now have these categories of extract load tools. We have these categories of transformation tools, and it really has done great work for getting more companies in front of their data, allowing them to analyze that data, make better decisions and overall just increasing the data literacy of the industry as a whole. However, as you know, as an engineer, I want to get into those APIs, right? Like, I want to get into the complexity that exist when you shift left up into the source data that’s coming in, and it’s hard to really abstract that into a commodity that’s going to work for every single enterprise. And it’s fun for me to like, sit here and like wax poetic about the evolution of this right? Like, you saw kind of something similar that occurred when, like data warehouse came out and people were finding ways to create these ETL tools, whether it be Informatica as ISIS and then as cloud native came out, you had a GCP kind of provide ways to have this commoditized services that allowed software engineering to occur in the cloud to make it easier. I think that’s what has kind of given the template for this modern data side to come out. But now that we’ve reached this point where it’s kind of matured a little bit, we kind of understand the space a little bit more. We’re recognizing that we aren’t addressing all of the use cases that exist there. And the other thing that gets left behind when you focus on speed delivery is the quality like we’ve touched on, like a if. If a report is updated on a daily basis and if they’re able to react quickly to the questions that are asked, people tend to trust that report and they’ll ask you any questions. But as soon as it breaks, they say, Oh well, why did it break? And only then do you start thinking about quality. And if we can remove that question and concern of why did it, that’s why. Why are these numbers wrong by addressing data quality as soon as it comes into your system? And that’s going to make everybody’s life easier. And we’re going to continue to see that acceleration in velocity. So that real time use case that comes in and having that cloud data or cloud native data stack is going to be a really interesting evolution over the next five years. And in my opinion, because I don’t think you’re going to be able to commoditized it the same way that we’ve seen in the past. But there will be ways to make it easier, right? We’re going to have abstract layers that exist. So just kind of on that idea of like evolution, how is not only the modern day, the side effect of the way Google thinks about the cloud native data management lifecycle? But how are you all trying to influence the way that that works, too, to address these streaming use cases?
Sudhir I think on my side, making it more seamless and easier to have that end to end pipelines run is one of the most critical things, right? There is no one size fits all solution. So having said that, for example, if you are getting Real-Time events from various different sources within Slipstream data or whether it is in-store purchases and all. And if I can take that and seamlessly say, Hey, anything that comes into pops up directly is available in BigQuery, remove all the code that you need to write in between just two more bytes from one place to another. If you’re not doing any transforms, assuming it’s extract, load and then transform. I want to simplify that. But on the other side, that doesn’t solve all use cases where you need things like streaming analytics on the fly, but you’re making real time decisions. So when people are doing that, they think having ability to go out and seamlessly connected with data flow and big table for those use cases, that hyper personalization is involved because that’s more of an application that people are building in real time. And so enabling these things and removing barriers for building these is a big thing for me. And the second thing that I as a doctor, more and more customers who have done large scale deployments and this is what you are saying, Josh and Harper were the quality issues are becoming one of the the bigger problems. I think we we basically have customers telling us like, Hey, our pipelines fail and we don’t know when they’re feeling or some of the things work. And also, I think that that is becoming more and more of a bigger challenge for everybody to figure it out.
Sudhir Yeah, I think it’s interesting that the Harper was alluding to. It’s like these the modern data stack and these new tools have really democratized a more sophisticated level of analytics and data processing. And with that democratization, you know, now more people have the right. A vote and participate in the data government, essentially, if you move too quickly, you can start to break things. And if you’re not thoughtful about the processes that you have in place and the tools we have in place with that democratization, you risk getting bad data out to more stakeholders. And I think part of this trend is the emphasis coming on data quality and making sure as we distribute more of these tools to the organization, break down silos of skill. It’s not just the Python developers anymore that can build pipelines, it’s the skill developers. How do we make sure that these data sets are certified for use? If we’re increasing the velocity of the delivery of those data sets, I think what’s what’s interesting also about where where we sit at data vendors, it’s not, you know, I mentioned before, we focus a lot on teams that are heavier in engineering. They are also very heavy in analytics engineering now. So both of these trends are happening together, right? It’s not like they’re staffing up only on data engineers and they have no analytics engineers. They’re just pushing forward in both of these categories really quickly. Our focus on the data engineering side is about how do we make sure that there’s good data coming into the environment from the first place, but from the analytics? And we also see the real critical importance on making sure that the data sets that are ultimately looked at by humans. You know, the the results that go into dashboards, the the predictions that are run by machine learning models, that those are also certified and well monitored and observable. These are additional areas that we see really critical attention needs to be needs to be placed up. I think what’s what’s also interesting going back to the the way that we see these data structures being built out, I don’t know if if Google necessarily is adopting the Lake House terminology, it’s that might be more Databricks this term at this point. But what’s interesting about GCP is that we do see that architecture really consistently being being built out within the GCP environments that we work with. So a lot of organizations are centralizing the the delivery of data and the processing of data into BigQuery, into that that centralized environment and using it to store and process everything from highly unstructured streaming data up to the analytical workloads within BigQuery tables. And I think that that paradigm of of ELT going into the lake and then LTE happening again at the warehouse layer, it just shows the different points in the process where observability needs to be focused on, like what are the critical pain points along that path where things tend to break? And how can you make that process smoother so that the deliveries of data are more reliable going out to end users?
Sudhir Now, I think makes sense. Just like if you think about, yeah, we do use the terminology of Lakehouse is just one of the patterns that our customers want and how do we enable that? Similarly, data mesh there are data roles like there are different patterns and we we assume all of those are going to be possible to build on top of our overall GCP platform, right? But BigQuery being the center of the universe for us. But if you think about it right, when you talk about extract, load and then transform, and if you assume your unified storage is actually a base of it, you know, in all the world, there was this thing where you moved data into edge. DFS ran a bunch of processing and Hadoop or SPARC workloads. Then you moved it into a warehouse where you basically used a sequel or some kind of processing for four dashboards and stuff like that. But where the cloud platforms are moving and especially on Google cloud side, is if you imagine that storage data of a single unified one, whether it was stored in GCS with party files of BigQuery with capacity inference, that’s just a file format issue. But for us, all capabilities on storage are going to be consistent. So you put data into it, and a lot of this customer now has exabyte almost like somebody to be in big query. So we are talking about the world’s largest meat that you could ever build. And then you’re running SPARC or Beam or A.I. with TensorFlow or BigQuery sequel on the same data. So I think that modern data stack definitional imbalance over a period of time where your storage is unified across all of these different paths, whether it’s S3, whether it’s aerials, whether it is obliquity storage, automated disks, and then you do different kinds of processing on top of that. Common capabilities underneath that. I think that is where I think the evolution will happen, where in storage will just be a price performance discussion for customers rather than capability differentiation. And then the question is based on your persona of the user. If I am a data engineer with prolific skills in Java or Python, I will be voting in that because I like doing that and you could use beam or spark as a framework for doing that. Or if I’m a data scientist, I’m going to use notebooks and I’m going to use TensorFlow or some different kind of machine learning platform to build models on top of the same storage yet. And you may be an analyst who’s actually doing it with sequel on top of the same data structure, but but now processing the data, analyzing it and also making it ready for dashboards and all, or in many cases where we are going, is empowering users or business users with things like connected sheets and all that. You can take massive amounts of data to do analysis and stuff like that. On top of that, so I think the personas will change. Different people will do different kinds of processing on the same data, but their governance gets centralized in that case. I think the big challenge in that environment is going to be, how do we guarantee or how do we how do people trust data? And we have this concept internally. We’ve been talked about in data. Cataloging is like trust is the most important thing. And so as you look at any asset, you should be able to go ahead and get some kind of a trust score, know that it’s really trusted dataset and all. And that is the place where we have to do a lot more information and figure it out. But when I think that is be one of the big challenges for everybody,
Harper as a self-professed data geek, I’m totally excited. When you talk about like storage being irrelevant, it starts like where it is right there because anybody can come in there with whatever their specific tool is that they want to, and that you have these ports and adapters that allow you to interact with that data storage in the way that it makes more sense for you and allows you to continue that velocity that I talked about a little bit earlier. But you make a good point. It’s like whenever you have that access, how do you establish and maintain the trust that exists in that data? And you know, it comes back to the data quality stuff that we’ve been talking about. And I can tell you, there’s there’s no more frustrating conversation than the conversation that starts with, well, we’ve got this data quality library, but the data is not right. OK, cool. Well, why is that? Where does that? Where’s the problem? Where does that exist? You know, I tend to equate data quality problems to gremlins in your system. You know, I had the Volvo back in the day and like, if you turned on the blinker, like all of a sudden, the windshield wipers would go right and there was just this electrical gremlin that I could never figure out. And no matter how often I looked at the schematics of it and I pulled out the different wires and reconnected them, something wasn’t right. And that’s kind of how a data quality issue feels like to me at times. And unless you have the ability to really see how did the data move from one source to one destination to one storage, how did you transform it between the steps? If you don’t understand what your data the state was before and what your data state was after and what happened to it in between those two states, you’re never going to find that gremlin, right? And so that’s kind of what’s interesting in the conversations we’ve had when it comes to where data fits into like the modern data stack and being able to really provide that insight into why your data quality isn’t meeting the standards that you think that you’ve set for yourself.
Sudhir Yeah, I would agree with that. I’m curious also, Sudhir, when you when you draw out this picture of almost like the data App Store, you know, you have the single storage operating system where all the data sets that exabyte level of information in this company that you mentioned and then all these different apps, that different skill levels or expertize levels or just attention areas may come in and build out of that. Do you have a nice fancy marketing term at Google yet to describe what how you reference that architecture is that kind of call back to? I think the other kinds of descriptions.
Sudhir Yeah, we are terrible at marketing, so we just acquired unified data platform or something. So OK, I need to come up with like Ali, come up with like Lake House or some from that actually is catchy. Yeah, great. I know. But I think as as we were talking about that, that is the that is the end state. We we want to be in there making a lot of progress in that and that space also just one more thing around quality. I do want to highlight. Big managing data quality is hardly right. And I before joining Google, I ran an ad engineering organization for four years, three and half before. Yes, absolutely, I built the data platform. Most of it was like Hadoop Spark stuff running on GCP and then BigQuery as a warehouse and rebuilding all engine pipelines. And all and quality was hard because it’s manual like everything had to be manually set up. I have to define what the quality rules look like for every table. And then we made it easy for any analyst to define their own rules, and we connected with JIRA. And then we did this and and used to get so many tickets every day. It was crazy hundreds of tickets because it do false alarms and all. I think so that the most biggest problem in the space is as we were talking about Josh, the complexity of environment is that the data types are different. You are running these pipelines with different personas. And if your quality and ability to monitor these things becomes too manual, nobody can do it perfectly well. And I think that’s the problem that that we all have. And so so I think I think the innovation that needs to happen in that is like one, how do we automate some of these collection of metrics and then these pipelines are running? Second, how do you automate alerting without having somebody define the rules? The old world of rules based data quality systems, which I see still predominantly being used in different companies. They just don’t work because it’s impossible to scale with people and the types of data and the like. Every day you’re adding more data products in the company. How are you going to go around and actually keep up with it? It’s really hard to do that, right? So I think so that’s what kills the man that’s causing the biggest problem in the organization. I want to find another notch customer has more than sixteen hundred thousand or fifteen hundred projects in query like environment. So think about it and thousands of people trying to create data products out of it and experimentation and all of that. How do you manage that with like with manual processes? I think that’s the bigger challenge that’s been in the industry, I think. I don’t know.
Sudhir It’s no question it’s a novel kind of issue that I think we we as technologists face. We compare ourselves a lot to other kinds of observability tools that people know about in the application and cloud and web world. So like the data, dogs of the world are the new relics of the world and the different flavor of challenges that you’ll find as an S3, which causes you to bring on a tool like New Relic. You know your applications going up and down, and it’s hard to figure out when something crashes. It’s just another whole nother level of complexity to map when when data is not healthy because it’s so particular to the domain that you’re going into, and there’s just so much of it flowing in. So we really need to rethink the ways that we’re monitoring. What makes data reliable, what makes it unreliable and how to generate trust in these systems? I think one of the novel ways that we’re trying to think about it is just starting from a perspective of what are our data sources to tell us about the data itself, right? Like, how are we making sure that we’re collecting the right kind of information from the get go before we start layering on the really required and necessary layers of machine learning and anomaly detection? A more statistically driven techniques to understand where issues are coming from. Depending on where you use source the information, that job becomes more or less challenging, right? If you’re if you’re trying to detect data quality issues from just looking at a pie chart onlooker, for example, you may be able to detect, you know, there’s something egregiously wrong, but heaven help you trying to figure out where that came from, right? Especially if you’re in one of these environments that are sourcing from tons and tons of data sources. If you’re if you’re just looking at the the warehouse layer, if you’re just kind of querying some data at rest in the warehouse, you’re getting closer. And for some teams, that might be enough to call out the right issues when you start layering in metadata or information that you collect from sources. When you start layering in information that you get from how your pipelines are running, from the tools that are actually running your pipelines and you start layering in information from your streaming platforms and all these different sources of of insights, you just get closer and closer to be able to call out when a problem has occurred. And the the machine learning that you develop in a way needs to do less of the work. Still critical, still important. But it’s sitting on a richer body of information to be able to call out issues so that some of the way that we think about it. And it also just points to the importance of interoperability in this world. And hopefully, you know, GCP and other folks making sure that you have great APIs for vendors like us to pull from. But yeah, that’s one of the vantage points that we think of as God.
Harper You know, all those layers that just kind of like that you talked about there. It speaks to the context that’s necessary to be able to understand the problem space that we’re working in, right? In that context that exists instead of data engineering. And Higgins, one of the leading reasons that you have that manual effort that you were talking about earlier, Sudheer, like you can’t just sit here and say there’s one size fits all. This is the way you always handle time zone data like that context is important and understanding where it’s coming from, why it’s coming from. But then, like the double edged sword of that is like, OK, we have this manual effort. We need these definitions from our analyst. I I absolutely like to show run down my spine. When you said you enabled analysts to create their own data quality rules because like, yes, I get it, but like, I know why you suddenly had all these JIRA tickets popping up. But speaking to that, though, it’s it’s that I do have source of truth because data quality at the end of the day is testing right and in software testing, you can identify what your source of truth is. You can identify what that source object’s going to be. You can say what that state is going to be at the end of the day when that function runs. It’s not quite as simple when it comes to data, because not only do you have the state of the data itself, you also have the code that’s running on that data. You have the storage that’s running on the data, and there’s different facets and characteristics here that really require you to capture all those layers in that context. So that way, you can then start thinking about how can we apply machine learning? How can we apply AI? How can we find a way to abstract the idea of data quality? So that way it’s no longer a manual effort, and only once you have that context will you be able to do that. And even if you have that context, it’s only going to work for Company X Y Z, whose context is being treated. You can’t then give it to company ABC. But curious from your perspective, do you have you seen any interesting uses of machine learning in the data quality space? Are there any like how do you see machine learning being used in data quality on the Google platform?
Sudhir I think it’s really early stages on that one, right? I think most of the machine learning is new to a lot of organizations. A lot of use cases that I’ve seen for machine learning has been to improve the business metrics and business side of the house. We have customers doing recommendations and segmentation and all of those kind of like predictions with package predictions and stuff like that. That’s been the dominant use I have seen only really early stage thinking and usage where people are collecting some levels of usage metrics to see how things are trending and it’s something breaking internally. We are doing a bunch of machine learning stuff even in BigQuery to figure out when things are failing now. At the scale that we run our fleets, it’s really impossible to detect issues without using some level of machine learning models and on and we look at what were the success rate, what was the performance of these queries? Are the degree any more time based on anomaly detection and all of that we have started using that internally and then we have some ideas of how our customers could leverage some of that for now, based on the data that they collect about their own lab and the pipelines that are running and all but really early stage. I haven’t seen that many organizations using machine learning models on top of collected data for data quality kind of use cases. I’m pretty sure that our customers, I don’t talk to every one of them but haven’t seen that mass adoption of that yet in that space.
Honor Sudhir, where do you see if we were to ask you to make some predictions as to where this trend is taking us, this evolution of the space as far as the growing importance of real time processing, the growing complexity of different use cases, as well as this ever growing heterogeneity of external data sources like Josh was talking about.
Sudhir Like I always say, there are three trends that are constant that are going to be there’s going to be more data that people, our customers are going to have uses. I want to generate that. I want to be more users accessing that data and building on top of that data. And there are going to be more use cases that are going to be deployed, whether they are going from simple analysis and dashboards to machine learning models to various different kinds of use cases that are going to be built on top of it. So that’s the only constant. So I think what fundamentally needs to happen from here to the next two or three to four years is, I believe, one automatic cataloging of all of the information that is there in your environments. I think I think historically again, that’s been a painpoint where users have had to board an import catalog and metadata and all of that. So centralized metadata catalog is going to be critical. The second is of lineage tracking across these systems and automated lineage tracking, not like manually trying to cover everything, everywhere, wherever we can automate that. That would be critical area of innovation. We are investing a lot in that space. The third is in general, our data quality. On top of that, a metadata lineage and understanding of all the things that are happening on the data processing that is happening. I think those are the three areas I think will be super critical. And we launched recently a service called Beta Flex, and the whole idea with data collection was to go in and have this common management and governance framework. But more importantly, automatic data discovery, lineage tracking, as well as making all data quality. And how do you manage for I think those are the two areas that I think a lot of innovation needs to happen so that our customers can trust the data and processes that they’re running.
Josh So what do you think it means in terms of the startup ecosystem and the proliferation of tools that we’re seeing today?
Sudhir I think I think choice is always great and not a lot more of these services are coming together to go ahead and enable these use cases to a different person. And then I mean, all of our customers use dbt for transforms, and it’s a great tool for you to go in and run on top of query and stuff like that. And there are five priorities that we partner with that a bunch of different companies coming into CTSI Collection Extreme is one of them, and then we work with various different companies. I think there will be proliferation of tools that will happen. My hope is there is a common definition of how we expose some of these lineage kind of events, and hopefully a standard is is defined and developed so that we can interoperate on the catalog side as well as on the lineage side. And that will help the whole ecosystem. I think today everyone is defining their own and I think that is one area. And as you mentioned, there’s interoperability is critical in that space. If you go ahead and build events to collect and run pipelines and observe those, and if they are different on different services, even in GCP or or you go down in the cloud and it’s completely different, then it’s going to be challenging to stitch all of this together into a single ecosystem. But I but I’m not sure how we’ll solve that, but I think that is one area I do believe some common standards are not helpful on API.
Harper Don’t worry, Josh, I took really good notes there. I’ll have them ready for the next roadmap, meaning that we have next quarter. But thank you. Thanks. Before we close out any thoughts you want to add on the predictions, Josh, when it comes to the space and where things might be going, either in the startup or being able to address the next wave of modern data stack evolution,
Josh I’m excited to get Sudhir back on the podcast where I’m going to drill into open source versus closed source with them and and get more perspective on on that dimension of tooling. But I think its share his perspective about the increasing need of standards and how that fits into interoperability so that we’re not all recreating the wheel as we try to get this metadata out and make use of that.
Harper I think it’s going to be really interesting to see how future products get mesh with the way that like dbt and transformation has occurred, whether that is the same level of depth that they go to, or whether there’s a more general layer of abstraction that allows for greater flexibility and greater interoperability with the different engineering practices that you see in different clients. But it’s definitely going to be an interesting space for the next, you know, probably 40 years, right? But we’ll see how it goes other over the next four to five. But Sudhir, thank you so much. I really love talking to you. A lot of really great ideas on our Josh. Always a pleasure. Thanks, everybody for listening and until next time.
Honor Thank you.
Josh Thank you, everyone.
Josh I guess it. Thanks. Everyone thinks I’m here.
Honor Hey, Harper, what’s up, how’s it going?
Harper Hey, I’m just enjoying the last few rays of Sun here before a Texas winter settles in, trying to get the day to day grind rolling out. How’s everything going with you? You’re excited for our guests that we have today.
Honor Yes, I am super excited. We have Sudhir from Google. Sudhir can you tell us a little about yourself?
Sudhir Hello, everyone, I’m so happy to be here. I’m the senior director of product management for all of our data analytics services at Google Cloud. So everything from BigQuery, which is our cloud scale data warehousing solution to all the data processing services and messaging services, cataloging data quality, governance, management, all of that area. That’s what I focus on. I’ve been with Google for almost 40 years now.
Honor Awesome. Well, I’m so really happy to have us in the room together today, and a favorite topic of ours in this space, as always, is the modern data stack and really glad to have you here to give us a little bit more perspective on how Google sees this. But maybe we can start with how do you personally define the modern data stack?
Sudhir I think that’s a great question, and there’s a standard definition that the industry uses and I have great relationships with with George at Fivetran. And normally when when you talk about modern data stacks, it is Fivetran, some kind of an extract load tooling something like Fivetran, something that can transform the data or load the data into a cloud native animals like big query. Then you use something to transform the data that’s dbt or the company that we bought a year back. It’s called Dataform, but some kind of a transformation engine that runs on top of the cloud data warehouse. And then you have the visualization layer, something like looker for or getting value and building applications. I actually think about it a little more wider than that. I started coining this term called cloud native modern data stack. And what I mean by that is more and more organizations are moving towards real time enterprises. If you look at most of our large customers that we work with our partner with, they’re more focused on providing a real time, personalized experiences for their customers. And when you start looking at that kind of a use case, you know, extract and load is great when you are working with SaaS applications, and this is where a lot of these organizations started with. But when you look at the real time events coming from somebody on the website trying to click through things or some person trying to buy stuff in the store and you want to go out and provide them with a better experience, you do get these Real-Time events and you have to process them in real time and then make the decisions, whether it’s fraud alert or personalized offers or something like that. And when you’re looking at that, the modern data stack in my world actually shifts here to focus more on how do you collect these events, process them in real time, make real time decisions. And then of course, they have to land into a cloud data warehouse like BigQuery for further analysis and all. But I think I look at it more broader than just limiting to few use cases that that may be common knowledge in the industry. I do think we are famous important. And then how do you make the real time decision becomes more interesting in the future?
Honor That’s a really great point that we are moving in that direction. What kind of impact are you seeing from an angle of what? Google customers are going to want in product design as we move towards more of a real time need of alerts and identifying data issues and the entire host of needs.
Sudhir Yeah, I know, I think that’s a great question. So if you take a step back and go back to what I was saying, you get real time events coming in. You collect them in one of these systems, like whether it’s Kafka, whether it is some like some kind of a service from variable process, these large scale events and in some technology, which could be something like data flow, which is that cloud native service for doing streaming analytics or spark streaming. If you want to use one of our partners like Databricks, you can do that and then you’re making real time decisions. You would need a serving layer, something like age based or big table. You may want to then move that into BigQuery for creating machine learning models. And then you are going to go ahead and leverage those models for giving personalized experiences. Now imagine that environment and the complexity of the environment, right? The most important thing is when data moves across systems, you want to make sure one the quality of data is high. Second, any kind of failures of data in the system is caught really early because you don’t want bad data plans putting into all the different systems. All. So I think I do believe one of the key things are design principles we have to figure out, and the industry is actually going through. A lot of innovation in this space is around this quality of data moving across the system. So data in motion, how do we make sure that you have high quality? And all, I think, is one of the biggest challenges for organizations because one wrong or one misaligned dataset into one of these systems can actually cause bad experiences for customers. And that’s nobody. Nobody wants that.
Honor Yeah, for sure, and I feel like, Josh, this is precisely the area that you frequently speak about is data quality of data and motion, and how do we prevent these issues from cascading by catching them at the source? So in the context of what Sudhir is talking about, this real time need of all the different processes where we’re seeing, what are the implications for where data observability sits and how it will pan out?
Josh Yeah, well, first of all, I think just on the definition of the the data stack, I think I really agree with how Sudhir reframe this, which is for there really isn’t a one size fits all solution for every organization out there. There’s multiple different modern data stacks that are going to emerge in the market depending on different segmentations of companies. So we’re pointed out one interesting dichotomy, which is organizations that require real time use cases, which clearly are not well covered within that standard stack of tools talked about within the modern stack today and require heavier firepower from solutions like Kafka and other technologies. And that creates a level of complexity which will take you in some ways out of the structured data warehouse and include some level of data. Lake or the staging environment work where a lot of raw data can be pushed and then transformed in some way before it gets loaded into the warehouse in a more of a structured form. IE, the way that we will often talk about this dichotomy of different flavors of the modern stack is between teams that are more analytics weighted versus teams that are more engineering weighted. Now, of course, every data team is going to be doing some balance of these activities, but in certain organizations, there’s just a lot more requirements for tools that have more. Larger amounts of functionality that help structured, unstructured, semi-structured data get into the environment in the first place, which will pull the team towards a heavier engineering posture using tools like pythons or Databricks data up front. What we often see with in our user base as well, for example, are teams that work with lots and lots of data sources. The more data sources you have, the more likely you’re going to be self-managing. The ingestion for those data sources and all of your your data providers aren’t going to be covered by tools like five, 10 or HVR. And this also pulls data teams into more engineering oriented activity and more Python oriented activity. So this is another way that we look at that dichotomy. This also creates a difference in posture to observability. The more if you imagine the flow of data looking like a triangle coming from single sources or just a few sources, and then all the activity kind of burgeons out towards the analytics end, that’s going to tend to be where your complexity is, and you’re going to focus a lot of your observability attention over there, like how people are using this simple data or fewer sources of data that you’re pulling into your stack. If you have more complexity at the initial stages, so there’s more thought into how you get real time data and or how you get lots and lots of data sources and and your your data flow looks a little more like an hourglass as opposed to a triangle. The requirements for observability really get shifted to the left. They get shifted forward in the data flow, and it becomes more important to understand where issues are coming from, from the source. The reason being is because if you’re only looking at the tables or the dashboards downstream, your team will still be spending a lot of time trying to find where those issues are coming from. You might be able to catch a problem, but by tracing them back to the source and being able to quickly debug or solve the issue becomes really, really complicated. So that’s one of the ways that we look at this difference in observability, depending on the different stacks that a team might be using.
Sudhir I think just to add to that, Josh, if you think about I love the simplicity of the modern data stack around extract load and then transform and and perform things and which is great, especially as you were mentioning, you have a bunch of SaaS application, some standard databases. I think it’s easier to go ahead and run with that. But when I work with large enterprises and their environments are super complicated, wherein you have mainframes and we have one of the largest retailers that we work with and they’re one of their most critical data was actually still in mainframes, and we have to integrate the mainframe data in which the formats are not clear and all of that. And then then there are other applications which are enterprise applications, and it’s great to see an acquisition by by Fivetran wherein now they are entering into the enterprise space. But I think those kind of enterprise applications, not just SMP, but Oracle and various other applications, they have very different posture. You need to go to APIs to call them and you have to process them and all of that. And then then, as I said, the real time aspect of like, we’re in there, a lot of custom software being developed for e-commerce sites and various other systems that are being built. And how do you handle that data? And the sooner you need to make decisions and power real time applications, it becomes more and more interesting that, as you said, that processing is moving then towards the left on the data engineering side in stuff analytics, engineering. And so it’s amazing that we have coined those two terms now. But yeah, you’re right. Like, I think there’ll be organizations that will be heavier on data engineering and running things through all of these complex environments, doing processing, making the data ready and then finally going into the analytics layer. And then there’ll be organizations that will be moving data directly into the analytics layer and doing an index engineering on audit, saying the way we think about it is it doesn’t matter whether you’re doing spot processing or anything versus doing a sequel based processing in BigQuery. We want to have a single unified storage idea that you can just move around like some of our largest customers like Twitter and all move all of their data sets hundreds of petabytes into BigQuery storage, and then they run Spark processing. On top of that, when they run data flow with stream processing on top of the same data assets. And then they also run sequel schedule queries and all of that. So you see different engines fitting on top of the common unified storage tier. But then the complexity of how do you manage? The environment exam and making sure that you have there as you are talking about observability, like how do you make sure these pipelines are running seamlessly across these technologies becomes a big challenge for consumers.
Harper I think one of the things that’s really exciting to have the conversation shift back toward the left hand side of the pipeline is that we’re seeing the effects of kind of the problems that existed in the data space for a long time. The modern data stack, in my opinion, has come about from the biggest pain point that data teams had when they were on prem. It was always the velocity to deliver. And so you see these products coming out and they were coming out in their commoditizing the ability to provide analytics, engineering. And so it’s great that we now have these categories of extract load tools. We have these categories of transformation tools, and it really has done great work for getting more companies in front of their data, allowing them to analyze that data, make better decisions and overall just increasing the data literacy of the industry as a whole. However, as you know, as an engineer, I want to get into those APIs, right? Like, I want to get into the complexity that exist when you shift left up into the source data that’s coming in, and it’s hard to really abstract that into a commodity that’s going to work for every single enterprise. And it’s fun for me to like, sit here and like wax poetic about the evolution of this right? Like, you saw kind of something similar that occurred when, like data warehouse came out and people were finding ways to create these ETL tools, whether it be Informatica as ISIS and then as cloud native came out, you had a GCP kind of provide ways to have this commoditized services that allowed software engineering to occur in the cloud to make it easier. I think that’s what has kind of given the template for this modern data side to come out. But now that we’ve reached this point where it’s kind of matured a little bit, we kind of understand the space a little bit more. We’re recognizing that we aren’t addressing all of the use cases that exist there. And the other thing that gets left behind when you focus on speed delivery is the quality like we’ve touched on, like a if. If a report is updated on a daily basis and if they’re able to react quickly to the questions that are asked, people tend to trust that report and they’ll ask you any questions. But as soon as it breaks, they say, Oh well, why did it break? And only then do you start thinking about quality. And if we can remove that question and concern of why did it, that’s why. Why are these numbers wrong by addressing data quality as soon as it comes into your system? And that’s going to make everybody’s life easier. And we’re going to continue to see that acceleration in velocity. So that real time use case that comes in and having that cloud data or cloud native data stack is going to be a really interesting evolution over the next five years. And in my opinion, because I don’t think you’re going to be able to commoditized it the same way that we’ve seen in the past. But there will be ways to make it easier, right? We’re going to have abstract layers that exist. So just kind of on that idea of like evolution, how is not only the modern day, the side effect of the way Google thinks about the cloud native data management lifecycle? But how are you all trying to influence the way that that works, too, to address these streaming use cases?
Sudhir I think on my side, making it more seamless and easier to have that end to end pipelines run is one of the most critical things, right? There is no one size fits all solution. So having said that, for example, if you are getting Real-Time events from various different sources within Slipstream data or whether it is in-store purchases and all. And if I can take that and seamlessly say, Hey, anything that comes into pops up directly is available in BigQuery, remove all the code that you need to write in between just two more bytes from one place to another. If you’re not doing any transforms, assuming it’s extract, load and then transform. I want to simplify that. But on the other side, that doesn’t solve all use cases where you need things like streaming analytics on the fly, but you’re making real time decisions. So when people are doing that, they think having ability to go out and seamlessly connected with data flow and big table for those use cases, that hyper personalization is involved because that’s more of an application that people are building in real time. And so enabling these things and removing barriers for building these is a big thing for me. And the second thing that I as a doctor, more and more customers who have done large scale deployments and this is what you are saying, Josh and Harper were the quality issues are becoming one of the the bigger problems. I think we we basically have customers telling us like, Hey, our pipelines fail and we don’t know when they’re feeling or some of the things work. And also, I think that that is becoming more and more of a bigger challenge for everybody to figure it out.
Sudhir Yeah, I think it’s interesting that the Harper was alluding to. It’s like these the modern data stack and these new tools have really democratized a more sophisticated level of analytics and data processing. And with that democratization, you know, now more people have the right. A vote and participate in the data government, essentially, if you move too quickly, you can start to break things. And if you’re not thoughtful about the processes that you have in place and the tools we have in place with that democratization, you risk getting bad data out to more stakeholders. And I think part of this trend is the emphasis coming on data quality and making sure as we distribute more of these tools to the organization, break down silos of skill. It’s not just the Python developers anymore that can build pipelines, it’s the skill developers. How do we make sure that these data sets are certified for use? If we’re increasing the velocity of the delivery of those data sets, I think what’s what’s interesting also about where where we sit at data vendors, it’s not, you know, I mentioned before, we focus a lot on teams that are heavier in engineering. They are also very heavy in analytics engineering now. So both of these trends are happening together, right? It’s not like they’re staffing up only on data engineers and they have no analytics engineers. They’re just pushing forward in both of these categories really quickly. Our focus on the data engineering side is about how do we make sure that there’s good data coming into the environment from the first place, but from the analytics? And we also see the real critical importance on making sure that the data sets that are ultimately looked at by humans. You know, the the results that go into dashboards, the the predictions that are run by machine learning models, that those are also certified and well monitored and observable. These are additional areas that we see really critical attention needs to be needs to be placed up. I think what’s what’s also interesting going back to the the way that we see these data structures being built out, I don’t know if if Google necessarily is adopting the Lake House terminology, it’s that might be more Databricks this term at this point. But what’s interesting about GCP is that we do see that architecture really consistently being being built out within the GCP environments that we work with. So a lot of organizations are centralizing the the delivery of data and the processing of data into BigQuery, into that that centralized environment and using it to store and process everything from highly unstructured streaming data up to the analytical workloads within BigQuery tables. And I think that that paradigm of of ELT going into the lake and then LTE happening again at the warehouse layer, it just shows the different points in the process where observability needs to be focused on, like what are the critical pain points along that path where things tend to break? And how can you make that process smoother so that the deliveries of data are more reliable going out to end users?
Sudhir Now, I think makes sense. Just like if you think about, yeah, we do use the terminology of Lakehouse is just one of the patterns that our customers want and how do we enable that? Similarly, data mesh there are data roles like there are different patterns and we we assume all of those are going to be possible to build on top of our overall GCP platform, right? But BigQuery being the center of the universe for us. But if you think about it right, when you talk about extract, load and then transform, and if you assume your unified storage is actually a base of it, you know, in all the world, there was this thing where you moved data into edge. DFS ran a bunch of processing and Hadoop or SPARC workloads. Then you moved it into a warehouse where you basically used a sequel or some kind of processing for four dashboards and stuff like that. But where the cloud platforms are moving and especially on Google cloud side, is if you imagine that storage data of a single unified one, whether it was stored in GCS with party files of BigQuery with capacity inference, that’s just a file format issue. But for us, all capabilities on storage are going to be consistent. So you put data into it, and a lot of this customer now has exabyte almost like somebody to be in big query. So we are talking about the world’s largest meat that you could ever build. And then you’re running SPARC or Beam or A.I. with TensorFlow or BigQuery sequel on the same data. So I think that modern data stack definitional imbalance over a period of time where your storage is unified across all of these different paths, whether it’s S3, whether it’s aerials, whether it is obliquity storage, automated disks, and then you do different kinds of processing on top of that. Common capabilities underneath that. I think that is where I think the evolution will happen, where in storage will just be a price performance discussion for customers rather than capability differentiation. And then the question is based on your persona of the user. If I am a data engineer with prolific skills in Java or Python, I will be voting in that because I like doing that and you could use beam or spark as a framework for doing that. Or if I’m a data scientist, I’m going to use notebooks and I’m going to use TensorFlow or some different kind of machine learning platform to build models on top of the same storage yet. And you may be an analyst who’s actually doing it with sequel on top of the same data structure, but but now processing the data, analyzing it and also making it ready for dashboards and all, or in many cases where we are going, is empowering users or business users with things like connected sheets and all that. You can take massive amounts of data to do analysis and stuff like that. On top of that, so I think the personas will change. Different people will do different kinds of processing on the same data, but their governance gets centralized in that case. I think the big challenge in that environment is going to be, how do we guarantee or how do we how do people trust data? And we have this concept internally. We’ve been talked about in data. Cataloging is like trust is the most important thing. And so as you look at any asset, you should be able to go ahead and get some kind of a trust score, know that it’s really trusted dataset and all. And that is the place where we have to do a lot more information and figure it out. But when I think that is be one of the big challenges for everybody,
Harper as a self-professed data geek, I’m totally excited. When you talk about like storage being irrelevant, it starts like where it is right there because anybody can come in there with whatever their specific tool is that they want to, and that you have these ports and adapters that allow you to interact with that data storage in the way that it makes more sense for you and allows you to continue that velocity that I talked about a little bit earlier. But you make a good point. It’s like whenever you have that access, how do you establish and maintain the trust that exists in that data? And you know, it comes back to the data quality stuff that we’ve been talking about. And I can tell you, there’s there’s no more frustrating conversation than the conversation that starts with, well, we’ve got this data quality library, but the data is not right. OK, cool. Well, why is that? Where does that? Where’s the problem? Where does that exist? You know, I tend to equate data quality problems to gremlins in your system. You know, I had the Volvo back in the day and like, if you turned on the blinker, like all of a sudden, the windshield wipers would go right and there was just this electrical gremlin that I could never figure out. And no matter how often I looked at the schematics of it and I pulled out the different wires and reconnected them, something wasn’t right. And that’s kind of how a data quality issue feels like to me at times. And unless you have the ability to really see how did the data move from one source to one destination to one storage, how did you transform it between the steps? If you don’t understand what your data the state was before and what your data state was after and what happened to it in between those two states, you’re never going to find that gremlin, right? And so that’s kind of what’s interesting in the conversations we’ve had when it comes to where data fits into like the modern data stack and being able to really provide that insight into why your data quality isn’t meeting the standards that you think that you’ve set for yourself.
Sudhir Yeah, I would agree with that. I’m curious also, Sudhir, when you when you draw out this picture of almost like the data App Store, you know, you have the single storage operating system where all the data sets that exabyte level of information in this company that you mentioned and then all these different apps, that different skill levels or expertize levels or just attention areas may come in and build out of that. Do you have a nice fancy marketing term at Google yet to describe what how you reference that architecture is that kind of call back to? I think the other kinds of descriptions.
Sudhir Yeah, we are terrible at marketing, so we just acquired unified data platform or something. So OK, I need to come up with like Ali, come up with like Lake House or some from that actually is catchy. Yeah, great. I know. But I think as as we were talking about that, that is the that is the end state. We we want to be in there making a lot of progress in that and that space also just one more thing around quality. I do want to highlight. Big managing data quality is hardly right. And I before joining Google, I ran an ad engineering organization for four years, three and half before. Yes, absolutely, I built the data platform. Most of it was like Hadoop Spark stuff running on GCP and then BigQuery as a warehouse and rebuilding all engine pipelines. And all and quality was hard because it’s manual like everything had to be manually set up. I have to define what the quality rules look like for every table. And then we made it easy for any analyst to define their own rules, and we connected with JIRA. And then we did this and and used to get so many tickets every day. It was crazy hundreds of tickets because it do false alarms and all. I think so that the most biggest problem in the space is as we were talking about Josh, the complexity of environment is that the data types are different. You are running these pipelines with different personas. And if your quality and ability to monitor these things becomes too manual, nobody can do it perfectly well. And I think that’s the problem that that we all have. And so so I think I think the innovation that needs to happen in that is like one, how do we automate some of these collection of metrics and then these pipelines are running? Second, how do you automate alerting without having somebody define the rules? The old world of rules based data quality systems, which I see still predominantly being used in different companies. They just don’t work because it’s impossible to scale with people and the types of data and the like. Every day you’re adding more data products in the company. How are you going to go around and actually keep up with it? It’s really hard to do that, right? So I think so that’s what kills the man that’s causing the biggest problem in the organization. I want to find another notch customer has more than sixteen hundred thousand or fifteen hundred projects in query like environment. So think about it and thousands of people trying to create data products out of it and experimentation and all of that. How do you manage that with like with manual processes? I think that’s the bigger challenge that’s been in the industry, I think. I don’t know.
Sudhir It’s no question it’s a novel kind of issue that I think we we as technologists face. We compare ourselves a lot to other kinds of observability tools that people know about in the application and cloud and web world. So like the data, dogs of the world are the new relics of the world and the different flavor of challenges that you’ll find as an S3, which causes you to bring on a tool like New Relic. You know your applications going up and down, and it’s hard to figure out when something crashes. It’s just another whole nother level of complexity to map when when data is not healthy because it’s so particular to the domain that you’re going into, and there’s just so much of it flowing in. So we really need to rethink the ways that we’re monitoring. What makes data reliable, what makes it unreliable and how to generate trust in these systems? I think one of the novel ways that we’re trying to think about it is just starting from a perspective of what are our data sources to tell us about the data itself, right? Like, how are we making sure that we’re collecting the right kind of information from the get go before we start layering on the really required and necessary layers of machine learning and anomaly detection? A more statistically driven techniques to understand where issues are coming from. Depending on where you use source the information, that job becomes more or less challenging, right? If you’re if you’re trying to detect data quality issues from just looking at a pie chart onlooker, for example, you may be able to detect, you know, there’s something egregiously wrong, but heaven help you trying to figure out where that came from, right? Especially if you’re in one of these environments that are sourcing from tons and tons of data sources. If you’re if you’re just looking at the the warehouse layer, if you’re just kind of querying some data at rest in the warehouse, you’re getting closer. And for some teams, that might be enough to call out the right issues when you start layering in metadata or information that you collect from sources. When you start layering in information that you get from how your pipelines are running, from the tools that are actually running your pipelines and you start layering in information from your streaming platforms and all these different sources of of insights, you just get closer and closer to be able to call out when a problem has occurred. And the the machine learning that you develop in a way needs to do less of the work. Still critical, still important. But it’s sitting on a richer body of information to be able to call out issues so that some of the way that we think about it. And it also just points to the importance of interoperability in this world. And hopefully, you know, GCP and other folks making sure that you have great APIs for vendors like us to pull from. But yeah, that’s one of the vantage points that we think of as God.
Harper You know, all those layers that just kind of like that you talked about there. It speaks to the context that’s necessary to be able to understand the problem space that we’re working in, right? In that context that exists instead of data engineering. And Higgins, one of the leading reasons that you have that manual effort that you were talking about earlier, Sudheer, like you can’t just sit here and say there’s one size fits all. This is the way you always handle time zone data like that context is important and understanding where it’s coming from, why it’s coming from. But then, like the double edged sword of that is like, OK, we have this manual effort. We need these definitions from our analyst. I I absolutely like to show run down my spine. When you said you enabled analysts to create their own data quality rules because like, yes, I get it, but like, I know why you suddenly had all these JIRA tickets popping up. But speaking to that, though, it’s it’s that I do have source of truth because data quality at the end of the day is testing right and in software testing, you can identify what your source of truth is. You can identify what that source object’s going to be. You can say what that state is going to be at the end of the day when that function runs. It’s not quite as simple when it comes to data, because not only do you have the state of the data itself, you also have the code that’s running on that data. You have the storage that’s running on the data, and there’s different facets and characteristics here that really require you to capture all those layers in that context. So that way, you can then start thinking about how can we apply machine learning? How can we apply AI? How can we find a way to abstract the idea of data quality? So that way it’s no longer a manual effort, and only once you have that context will you be able to do that. And even if you have that context, it’s only going to work for Company X Y Z, whose context is being treated. You can’t then give it to company ABC. But curious from your perspective, do you have you seen any interesting uses of machine learning in the data quality space? Are there any like how do you see machine learning being used in data quality on the Google platform?
Sudhir I think it’s really early stages on that one, right? I think most of the machine learning is new to a lot of organizations. A lot of use cases that I’ve seen for machine learning has been to improve the business metrics and business side of the house. We have customers doing recommendations and segmentation and all of those kind of like predictions with package predictions and stuff like that. That’s been the dominant use I have seen only really early stage thinking and usage where people are collecting some levels of usage metrics to see how things are trending and it’s something breaking internally. We are doing a bunch of machine learning stuff even in BigQuery to figure out when things are failing now. At the scale that we run our fleets, it’s really impossible to detect issues without using some level of machine learning models and on and we look at what were the success rate, what was the performance of these queries? Are the degree any more time based on anomaly detection and all of that we have started using that internally and then we have some ideas of how our customers could leverage some of that for now, based on the data that they collect about their own lab and the pipelines that are running and all but really early stage. I haven’t seen that many organizations using machine learning models on top of collected data for data quality kind of use cases. I’m pretty sure that our customers, I don’t talk to every one of them but haven’t seen that mass adoption of that yet in that space.
Honor Sudhir, where do you see if we were to ask you to make some predictions as to where this trend is taking us, this evolution of the space as far as the growing importance of real time processing, the growing complexity of different use cases, as well as this ever growing heterogeneity of external data sources like Josh was talking about.
Sudhir Like I always say, there are three trends that are constant that are going to be there’s going to be more data that people, our customers are going to have uses. I want to generate that. I want to be more users accessing that data and building on top of that data. And there are going to be more use cases that are going to be deployed, whether they are going from simple analysis and dashboards to machine learning models to various different kinds of use cases that are going to be built on top of it. So that’s the only constant. So I think what fundamentally needs to happen from here to the next two or three to four years is, I believe, one automatic cataloging of all of the information that is there in your environments. I think I think historically again, that’s been a painpoint where users have had to board an import catalog and metadata and all of that. So centralized metadata catalog is going to be critical. The second is of lineage tracking across these systems and automated lineage tracking, not like manually trying to cover everything, everywhere, wherever we can automate that. That would be critical area of innovation. We are investing a lot in that space. The third is in general, our data quality. On top of that, a metadata lineage and understanding of all the things that are happening on the data processing that is happening. I think those are the three areas I think will be super critical. And we launched recently a service called Beta Flex, and the whole idea with data collection was to go in and have this common management and governance framework. But more importantly, automatic data discovery, lineage tracking, as well as making all data quality. And how do you manage for I think those are the two areas that I think a lot of innovation needs to happen so that our customers can trust the data and processes that they’re running.
Josh So what do you think it means in terms of the startup ecosystem and the proliferation of tools that we’re seeing today?
Sudhir I think I think choice is always great and not a lot more of these services are coming together to go ahead and enable these use cases to a different person. And then I mean, all of our customers use dbt for transforms, and it’s a great tool for you to go in and run on top of query and stuff like that. And there are five priorities that we partner with that a bunch of different companies coming into CTSI Collection Extreme is one of them, and then we work with various different companies. I think there will be proliferation of tools that will happen. My hope is there is a common definition of how we expose some of these lineage kind of events, and hopefully a standard is is defined and developed so that we can interoperate on the catalog side as well as on the lineage side. And that will help the whole ecosystem. I think today everyone is defining their own and I think that is one area. And as you mentioned, there’s interoperability is critical in that space. If you go ahead and build events to collect and run pipelines and observe those, and if they are different on different services, even in GCP or or you go down in the cloud and it’s completely different, then it’s going to be challenging to stitch all of this together into a single ecosystem. But I but I’m not sure how we’ll solve that, but I think that is one area I do believe some common standards are not helpful on API.
Harper Don’t worry, Josh, I took really good notes there. I’ll have them ready for the next roadmap, meaning that we have next quarter. But thank you. Thanks. Before we close out any thoughts you want to add on the predictions, Josh, when it comes to the space and where things might be going, either in the startup or being able to address the next wave of modern data stack evolution,
Josh I’m excited to get Sudhir back on the podcast where I’m going to drill into open source versus closed source with them and and get more perspective on on that dimension of tooling. But I think its share his perspective about the increasing need of standards and how that fits into interoperability so that we’re not all recreating the wheel as we try to get this metadata out and make use of that.
Harper I think it’s going to be really interesting to see how future products get mesh with the way that like dbt and transformation has occurred, whether that is the same level of depth that they go to, or whether there’s a more general layer of abstraction that allows for greater flexibility and greater interoperability with the different engineering practices that you see in different clients. But it’s definitely going to be an interesting space for the next, you know, probably 40 years, right? But we’ll see how it goes other over the next four to five. But Sudhir, thank you so much. I really love talking to you. A lot of really great ideas on our Josh. Always a pleasure. Thanks, everybody for listening and until next time.
Honor Thank you.
Josh Thank you, everyone.
Josh I guess it. Thanks. Everyone thinks I’m here.