How To Treat Data As A Product

Colleen Tartow is the Director of Engineering for Starburst Data, a product suite that gives users the ability to self-manage their data infrastructure. We had a great time in the MAD Data studio as she shared insights on how data-driven companies should treat their data as a product and dove into the category of data mesh. Colleen also holds a Ph.D. in astrophysics which makes her truly out of this world!

How To Treat Data As A Product

About Our Guests

Colleen Tartow

Director of Engineering Starburst Data

Colleen Tartow has spent more than 15 years as a leader in the data, engineering, and analytics space. Her passions include the advancement of women in technology, mentorship, growing teams, and of course data. She has a Ph.D. in astrophysics and is the Director of Engineering for Starburst Data.

Episode Transcript

Ryan: Hey everyone, my name is Ryan Yackel and I am the host of the MAD Data Podcast, and today we have a very special guest on our podcast, Colleen Tartow, who’s the Director of Engineering over at Starburst. How are you doing today, Colleen?

Colleen: Hey, I’m great I’m excited to be here. Thanks for having me.

Ryan: I guess I should ask you, how are you doing today, Josh? How you doing?

Josh: Doing well. Things are all right. It’s another sunny day in Philadelphia over here and excited to talk with Colleen today on our podcast.

Ryan: I like how you did that little, little double entendre there with the Always Sunny in Philadelphia. Well, that’s one of my. Yeah, it is. It’s one of my favorite shows, actually.

Josh: It’s always sunny here. So.

Ryan: Okay. So today we’re going to be talking about how to treat data as a product. But first, I actually want to get Colleen to give some of her career background of how she got going, especially now when she’s at Director of Engineering over at Starburst. But go ahead Colleen let us know, like, how did how did you get to where you’re at today? Were some fun things that the audience would like to know of how you got into this career and maybe something that nobody knows about you. That would be great to understand that.

Colleen: Thanks, though. Yeah. I’m Colleen Tartow. I run enterprise engineering at Starburst here in Boston, and I’ve been at this data game for a long time now. I actually got my start doing. I have a PhD in astrophysics where I studied Starburst Galaxies and I basically was taking giant datasets from telescopes and then slicing and dicing the data to come up with stories to tell about the universe. And so at its heart, it was really data and analytics, which is kind of cool. And then I kind of switched over to the software world and my background since that has been more in the data engineering and analytics space where I was first a data engineer and an analyst, and then I was building and running data and analytics organizations across different technologies, in different industries prior to coming to Starburst. And I, let’s see, I came here about two years ago, almost two years ago, and I ended up coming here because I deeply believe that Starburst, which is built on the open-source Trino, can actually address and fix a lot of the common pain points that data engineers and analytics people feel every day, which I had personally felt those problems as well.

Ryan: Awesome. What was a really fun astrophysics project that you worked on?

Colleen: I got you some really amazing telescopes and including Keck in Hawaii and Hubble and the very large array in New Mexico and Arecibo in Puerto Rico. So I got to travel a lot, which was fun. I let’s see, my thesis project was around, you know how like if you look at the Hubble images that give you, you know, show how many galaxies there are the universe in between the galaxies, it’s black, they’re not actually empty. There’s a lot of stuff between the galaxies and what they call the intergalactic medium. And so the question is, how did it all get there? How did all that stuff get there? And so part of my theory was that it came from these smaller Starburst Galaxies, meaning they were going through a burst of star formation which would push matter out into the intergalactic medium. And so we have a few hypotheses that we proved and disproved from that theory. That’s what I did for my Ph.D., which was really fun.

Josh: There’s a really famous photo of I think it’s the Hubble maybe zooming in to a black spot in between stars or galaxies that are in view, and then blowing that up and seeing what look like to be more stars and galaxies, as I think a statement on how vast the universe is, is that is that what you’re what you’re talking about it?

Colleen: Yeah, I studied mostly the nearby universe, but yeah, the Hubble Deep Field is just that. It’s like, you know, just like looking deeper and deeper, which the further away you look in space, the farther back in time you’re looking, because the light has taken that long to reach you. So you’re actually looking closer and closer to the beginning of the universe, which is kind of insane.

Josh: Yeah.

Ryan: I feel like we’re talking to Neil deGrasse Tyson right now and just getting to a download of the the the universe.

Colleen: I’m a little less famous than him.

Ryan: Yeah, well, so one of the things that we had talked about, too, is getting your perspective around kind of some trends that you’re seeing in the space, especially at Starburst. I see you doing a lot of thought leadership stuff. You got to do these really cool hats that I’m wearing around Hot Mesh represent Starburst Company here. But what are some of the trends that you’re seeing in the data analytics space that should be on our radar? And what are some maybe some roles and responsibilities over the next five years? There can be core, you know, pretty vital in our space.

Colleen: Yeah, I think the number one trend is obviously the Hot Mesh trucker hat from Starburst. That’s the hottest trend in 2022. You know, I generally do think this is a really exciting time and there’s so much going on and it can be really hard to keep track. I feel like that’s a full time job, just keeping track of what’s going on in this industry. So I do love this question because I love hearing other people’s answers, too. I’d say there there are several trends that I think about a lot. The number one trend is probably data mesh, which is a framework around a decentralized data management paradigm and like data observability or data quality automation frameworks. There’s just a lot going on around, you know, we’ve got this separation of storage and compute and data access generally. We’ve got the cloud, we’ve got managed services, we’ve got SAS and pass and ICE and you know, it’s really just opened up all sorts of possibilities around data and analytics. And so I think that the simply, you know, vast number of combinations of these technologies and frameworks that a company can have is leading to this like optionality and maturity that we haven’t seen before in data because people no longer have to think about hardware. They can think about software, and they no longer have to think about managing software because they have a managed service. They can think about, you know, the business value of the data, which is really what we’re trying to get here. And so I think, you know, people are having fun trying to wrap their heads around it and try to figure out how they fit into that space and how, you know, what is the platonic ideal of that space look like? As for roles and responsibilities, the role that I think about a lot is the data product owner and that comes out of treating data as a product. And when we start applying product thinking to data, you have to have you have to treat it like any other product and you end up with these like, you know, architectural and functional groups of closely related data sets that are called data products. And because there are consumers, depending on the data products, to extract a business value from data, you actually do need a data product owner in that paradigm. And so I think also, you know, having been in that world for so long, I think finding solid leadership in any in any organizational construct is challenging. So like data engineering, leadership is always a challenge. It was always hard for me to find good data engineering leaders, so I’m always biased towards that as a career path because that’s what I’ve chosen. But I do think that having looked for folks to fill those roles in the past, both the data, product owner, data architects, data engineering leaders, there’s a real dearth of people in those roles. So I would hope we see more people pursuing those career paths in their future.

Josh: Is the data product owner persona that Starburst addresses directly with your your product or somehow indirectly, how do you see that in your in your usage map?

Colleen: Yeah, I mean, we definitely do. We have the idea of data products really solidly into our software now. We have a product called data products. And so like the data product owner would have, you know, would own an experience within that and really drive the idea of data as a product across their function within a company. And so a data product owner is, you know, someone who is fairly technical but not necessarily a data engineer, but they understand the analysts and they understand the data engineering at a high level. And so they can sort of be the translator between those two worlds.

Josh: I think the the interesting one of the interesting trends that we really track here is the parallels between what’s happening now in data ops and what happened ten, 15 years ago in DevOps, and how the roles separated and specialized within computer science organizations towards software engineers that focus on building software products and underneath them or next to them, DevOps folks and stories that that focus on managing the infrastructure. We see definitely those parallels within data teams today, separation of roles between analytics engineers, data engineers, data platform, just helping each link in the chain focus more on what they do so that they do it well. So the that comment about data product emerging as another role within the data team and then he follows along, I think those tracks and similar patterns that we saw in the DevOps world. Another interesting thing is just like in dev ops, we’re starting to see folks use measurement more on how they they evaluate the performance of the data org. So I’m curious in how Starburst is catering to that data, product persona or even generally with these different roles that you’re seeing like are you suggesting any ways that data teams measure their efficacy more as they adopt data ops practices?

Colleen: Yeah. And to address that, I would start by talking about data products. At Starburst, we’ve been thinking a lot about how we can enable our customers to build data products. And it’s the heart of the data mesh that regardless of whether you’re doing data mesh, treating data as a product is paramount these days, right? Like every business wants to be a data driven business or a data enabled business. And so specifically in a Starburst focused ecosystem, there’s a definition of data product that I like to think about. And Starburst means that you’re about building pipelines to move data around, but you still care about things like data quality, and you’re using school to access your data wherever it is with this, you know, insanely awesome and powerful and fast query engine. But when thinking about combining the idea of Starburst with the idea of data products, we actually came up with an interface that services both data product creators and data product consumers. And from the creator side, data engineers can use sequel to define their data products as views or materialize views across data sources. And the key to that, though, is that it comes with all the relevant metadata that distinguishes it as a data product. It’s not just a data set or group of data sets, it’s curated and it comes with information, and that includes quality and that includes owners and descriptions and definitions and simple queries and discussion screens and previews and metrics and everything. So in addition to that, we have this like really simple user interface that allows people to create those data products and then also allows people to consume those data products and easily get from that data product into just writing sequel against it or using a BI tool. And so, you know, we unsurprisingly called this Starburst data products and it was launched in February and it’s being incredibly well received because it does answer a question that people are finding it challenging to find a solution for of like, how do I produce and consume data products within a platform? Because it’s not just a data catalog, it’s more than that. It’s using the Starburst engine plus the data catalog.

Josh: So what’s interesting about that, Colleen, is just how it sounds like you talk about what could actually be the same data asset in one version of a team and how they handle that. That might be what you’re calling a data set. And after you do some steps, it sounds like that can then become a data product and you might be actually describing the same CSV file, right? And in one version of the world, that’s a data and another version of the world that’s a data product. So like what exactly what’s the delta there? When you talk about things like metadata, for example, like what are the what’s the formula to take a data set and turn it into a data product?

Colleen: I mean, a really good question because I think the definition of data products has been a little fluid over the last couple of years. People are starting to understand what the definition is because it could just be as simple as a sphere, a table or a flat file and then S3 bucket. But it’s really the addition of that metadata that’s I think of as the bare minimum to make a data product, right? Like it has to have all of the information to contextualize the data for the consumer so that the consumer doesn’t need to go back to the data creator and say, Hey, what does this mean? Right? Like they have all of the information along with the data itself. And so on top of that, you can also say that it should include the infrastructure and the code that goes along with it to create the data product. Like you can come up with, excuse me, a fully fledged, fully fledged definition of what a data product is that is completely inclusive. But I do think at the very least it has to be the table itself or the data itself and the metadata and also an access method. So like, you know, without people being able to simply access it, you know, it’s kind of useless. And so it’s like quality and access are really important in my mind. So like I like to say that, you know, SQL is the lingua franca of data, so for me it’s always equal.

Josh: Is there any distinction that you see with data products between whether that’s a dataset that’s going to be used inside of a business versus a data set? That’s. Ing product. Well better word monetized in some way and outside the company. Are both of those data products one of them more a data product than the other? How do you separate those?

Colleen: Yeah, that’s a good question. I think they’re both data products in my mind. It’s just the consumer is different. So, you know, when you’re curating it, do you have PII in it? Like that’s something you need to consider. You know, there’s governance that you need to put in and around it, but if you’re monetizing it, then you’re presumably have a separate consumer than you would internally. And so you need to think about sort of who the consumer is. And I’m a big fan of the consumer aligned as opposed to source aligned data products. I’m really thinking it’s a product. Think about your end user. You know, this is why you have a product owner and a product we might have a data product manager, right? Someone who’s thinking about what does the customer need and then creating requirements or treat it like product management in a way.

Ryan: What are some some cultural aspects that you run into when you maybe go into organizations say, hey, we’re going to start treating data as a product. Yeah. Or, you know, and they’re like, What? What does that mean? What are you talking about? Like, what are what are some cultural things that are political things that you have to navigate when you’re speaking in this type of language and trying to transform different ways of doing things now?

Colleen: Yeah, I think that’s a really key question because this is probably the bigger challenge. It’s not the technology. We have all these great technologies that can help you with that obviously Starburst tip and being the best. But you know.

Ryan: Of course worse.

Colleen: Obviously. But I do think there’s a cultural and organizational challenge as well because there’s an ownership aspect to the data, which means that, you know, you need to think about the ownership as well as, you know, things like speed and governance and access and all that. And so from a cultural and organizational perspective, you’re now telling folks in that are the data producers that they have some other product they’re producing that’s part of their roadmap. But now they also have a data roadmap. So you’re really like expanding their scope and they’re you’re giving them a second major initiative. You’re saying you are also responsible for this downstream data. And so organizationally, you need to think about, you know, what that does to your roadmaps. And there’s obviously a management question there, but you need to have people understand that they are responsible for this and that they are producing something that is a product. And so the easier you can make that for people, the better. And there’s an organizational component to that. There’s a cultural component to that. So cultural component on the consumption side, yeah, there’s different pieces of that. And, you know, people don’t exactly love change, so you got to manage all that.

Ryan: Yeah, I don’t think that we’re we’re very change adverse I think and getting about your point too, there’s tons of tools out there. I feel like every single and I was like talking to one of my friends the other day about this, about how it seems like everybody is working for a company that’s exploding or in a space that’s exploding. And data obviously is one of the biggest spaces to be at right now. So you’re not short on. You’re not short on tools. And process is a huge part of that. When we talk about speed, because I had seen some notes here about when we were prepping for this podcast, but we talked about how to how to get access to data products quickly. What is that really? What does that really mean in the in the grand scheme of things when it comes to the speed of how you’re delivering your data as a product, like what’s a what are the advantages that in terms of speed that doing treating data as a product has over other methodologies that’s out there?

Colleen: Yeah, I think part there’s two things that go together here and one is data as a product and one is the idea of a decentralized data management system. Right? Because you want the people who create the data to own that data and treat it as a product. And so in and previously what folks would do before they were tasked with treating data as a secondary product was they would just ship their data off somewhere and say, not my problem anymore. Right. And you would end up with like a centralized data team required to take data from all around the organization, try to manage it together and create data that analysts can consume. And that’s an absolutely huge bottleneck that you’re creating. And I’ve worked in many organizations where you try to tease that apart. You try to pull pieces of that out. But ultimately, like, you’ve got this central, you know, you’re trying to get to that holy grail of a single point of truth or a single source of truth for your data. And that doesn’t really work. I’ve never seen it work particularly well, especially as an organization. Scale is just because it’s too much context for one team, and the people who really understand the data are back on the data producing teams. And so the idea of data as a product really dovetails into that. When you come up with the idea of the data being consumed easier and faster because the people, there’s no more central team creating that bottleneck. So the data producing teams are getting their data directly to the consumers without that sort of middle, middle step of having that centralized data team.

Ryan: That’s I always see so many similar every time I talked to experts in the field kind of like do Colleen and I hear these same problems about hey, we’re just pushing data to the endpoint and then they had to figure it out or it’s not my problem anymore after I do this. Like the similarities between what’s going on, on a data platform team or data engineering team or even the whole team together just seems like it’s running into the same problems that developers had with operations and the meshing of data or DevOps to fix a lot of these problems. Like you talk about decentralizing things, that’s exactly what DevOps is trying to do, right? Trying to decentralize one bottleneck to release to software. How will we just do it together? And we don’t need to go through that. So I just I just find it fascinating that there’s so many similarities between the data space and the software space that seem to be very, very close together.

Josh: Piggybacking on that, like, what do you see as the biggest challenge in. Data products versus software products. Like what? What’s what why can’t you just use the same kinds of techniques and tooling and software in the data side of the world? Or can you. And where where do you see the biggest the biggest differences there?

Colleen: Yeah, I think you actually can. And it’s funny because over the last few years, we’ve made a big difference between data. Data engineer and software engineer have, you know, driven each other further and further away from like a hiring perspective. Like they’re incredibly different people and they go to different schools and programs and things like that, and they have different skill sets, but at their heart they’re really the same thing because if you think about it, you know, you’re still using a lot of the same tools. And if you’re thinking of data as a product, they should be very similar, right? Like you, you need to operationalize, you need to do production quality code, you need to have multiple environments. You have to have a team, you have to have all of that. And so I actually think in some ways they’re coming back together. I’d like to see them come back together because I do think there is an incredible amount of overlap and that’s sort of yeah, I’ve seen that over the last few years and I don’t love that I’m seeing that right. Like I think they are very similar and so I’d love to see them come back together.

Ryan: You know, Josh and I, when we talk to people, that data band, a lot of times they won’t. Some of them won’t even have data engineer in their title. They’ll have software engineer that’s on the data team. And when you talk to them, it’s exactly what you just said, Colleen. They’re they were very much focused on software development, engineering, and then they business started to do more with data. So they moved into this data engineering role, which they’re taking a lot of the seams that they have and they’re just kind of applying it to the data side. Yeah.

Colleen: And when I was first hiring data engineers, I, we called it the data team, but we were putting out job listings for software engineer comma data, right? Like it was a data focused software engineer, but it’s still Java or Python, you know, get DevOps, CI, CD, all that good stuff still comes into play there.

Josh: Similar question, but on the product management side, because I was a product manager before data and so that’s a that was a place of my heart. What do you see as the what are the differences between taking like a software PM, an application PM and dropping that person into data products? What how should they be thinking about the world differently? I assume there are important differences somewhere. What do you think are the things that would make that person successful versus make that person unsuccessful?

Colleen: I mean, I think understanding the I mean, in my mind, product management, a lot of it is customer research, understanding and prioritizing, customer asks and then coming in and creating requirements and doing testing and all that, you know, working with an engineering team, all of that I think is exactly the same. The difference is sometimes if unless you’re monetizing your data with an external customer, your clients are often internal folks, right? They’ll be the C-suite or the execs or the analysts. So the data scientist, that kind of thing. And so I do think there’s a huge overlap. I think that, you know, it depends how technical PM you are, you know, some PMS used to be engineers and so, you know, I think a PM who used to be a data engineer on the data side would be amazing, right? Somebody who understands pipelines and understands all the different data technologies. But I don’t I think it’s an interesting enough space right now that. I’d love to see a lot more people going into that field.

Josh: Do you feel like there’s a certain size of organization where this becomes important, do you think? Every company, no matter how big or small, as soon as anybody is doing anything with data, you need data. Product managers, you need to think about this is data products or is there a scale that this becomes relevant?

Colleen: Yeah, I mean, I think if you have a 30 person company, maybe you don’t need a data product manager yet. But I do think that as soon as data becomes part of your strategy and that should be fairly early on, you need to start thinking about data as a product. And that could be incredibly early on. But whether or not you have a data product manager, you know, I’ve worked at startups long enough to know that you often start with the bare bones and the absolute minimum number of people, and you’re all wearing 15 hats. And so, you know, you might not have a specific data product manager, but as as company scale, then the amount of data and the volume of data and the scope of data just, you know, outruns you and you can’t possibly keep track of it, you know, and do all of those other 14 jobs. And so you need to have someone who is really focused on that and understands the data that is going to help grow the business.

Ryan: So as we’re wrapping up here, we’ve got about maybe five, 8 minutes left. I did want to talk about one other thing, because I’m wearing this hat that I got at the Data Council that the Starburst booth that says Hot Mesh. And so you talk about data mesh a lot. I’ve seen some of your webinars going and you’ve you’ve written some articles on Starburst, but one of the things about treating data as a product is the idea of the data mesh. Could you explain the pillars of the data mesh or is that high level what that means in our space today?

Colleen: I am a nerd for data mesh. Data Mesh has been around for a few years now, and it’s really just the idea of, like I said, it’s the decentralization of the data organizations, you know, more like centralized bottleneck data team and then data as a product. So the first pillar there are four pillars, and the first one is the domain oriented ownership and architecture, meaning the domain being a group of people who all work together on the same business function. So it could be a team, it could be an apartment, it could be a business unit. But really just people who own that data and that data that domain owns and architects their own data environment rather than relying of a centralized entity. And then the second pillar is data as a product. So I think we’ve probably covered that one pretty well. The third pillar is a self-service infrastructure, and the fourth pillar is federated computational governance. So the third pillar, self-service infrastructure is really all about letting the domains focus on data and not making them have to focus on data infrastructure. So having some sort of centralized I.T. function that’s providing the infrastructure that they can then use in whatever way they see fit. And so it’s a federation of sorts. And so just figuring out what the domain should own versus what should be centralized, that’s a key component of this. And then the last thing, federated, federated, computational governance, the computational part, meaning everything should be automated as much as possible. And the federated part meaning, you know, there’s a ton of different aspects to governance, quality being obviously a very important one entity standardization, data, access, control, security and compliance, governance, risk, all those things. So there’s so many different aspects to that. And some are going to be set at the federated level and some are going to be at the domain level. And so really understanding where on the spectrum you are for each of those things is important. So if you put all of that together, it’s a paradigm that says, okay, if you follow all these things, you’ll get to this point of having a data motion, a common sort of misunderstanding is that a data mesh is a thing. There’s no data mesh in a box. You can just install a data mesh. It’s I think its creator calls it associate technical paradigm.

Ryan: You had a really cool term. It was like socio technical. Yeah. What is that?

Colleen: So the creator of data mesh calls a socio technical paradigm, meaning there’s a social component as well as a technical component. And the idea is that this domain oriented architecture and organizational ownership means that you might have to move your data engineers from a central team to the domains. You might have to split up that team. You might have to turn some software engineers into data engineers and make them responsible for that. Your product owners and product managers might have to bring data into their purview as part of their scope of responsibility. And so there is a social component to it as well, which is interesting to me. And so data mesh is less of thing is like I said, it’s not something you can install and it’s not an end state. It’s really a journey. That you’re going on to try to become more decentralized and to essentially reduce the time to value for data because it’s all about providing better data to the end users.

Ryan: As we close here where you come up on time. I want to ask you, what’s what’s like the one thing you want the audience to take away from this podcast? They can only remember one thing. What do you want them to remember?

Colleen: Treat data like a first class product. Yeah. And all the things that it entails.

Ryan: Awesome. And last thing is tell us a little bit about Starburst. I know that you’ve been there for a little while. What’s it like working there? And we like to highlight kind of where people work in and show off all the great things you guys are doing over there. What’s it like to work over at Starburst?

Colleen: Yeah, it’s been fun. We are, of course, growing like gangbusters. We’re a rocket ship. All those fun things people say about all these companies these days, Lightspeed. Yeah, but, you know, we like to talk about how despite our, you know, we just got our D round a couple months ago, $3.5 billion valuation. But we like to say that we’re not unicorns because unicorns aren’t real and we’re workhorses instead, like Clydesdales. In fact, our our mascot is Gritty, the Clydesdale. So and she is a unicorn. She has a horn, but she’s still Clydesdale. But, you know, workhorses work hard alone, but when they work together, their forces multiply. The sum is greater than the parts. And so that says a lot about our ethos. We work hard, we believe in what we’re doing. We care a lot about ownership and character and grit. And I’m making it sound really dry and boring, and I swear it’s not. It’s also a really fun place to work and it’s really motivating. And I deeply believe in what we’re doing and we’re hiring. So then good looking, please contact me.

Ryan: Josh, what’s our what’s our mascot? Killer. Killer? Whale, I’m just trying to think of like the most.

Colleen: Is it you Ryan?

Josh: Yeah Narwhal is awesome.

Ryan: I love Killer Whale is like the king of the ocean.

Josh: Yeah, I love Clydesdaling the unicorn. I mean, yeah, we’ll think about stealing that.

Colleen: Orcas are apex predators, too. That’s pretty cool.

Colleen: Killer. Yeah, well, orca like dolphins as well. I don’t know. Let’s riff Ryan afterwards. We’ll update it in the next podcast.

Ryan: I really like the way I’ve never heard that before. Like that spin of, Hey, we don’t want to be called unicorns unicorn because they don’t exist and we’re work where I like that a lot. That’s really cool. And also Clydesdales are the ones that carry the Budweiser beer during Christmas. Right?

Colleen: So that’s one of my first thought when I played a commercial with a puppy. Yeah.

Ryan: All right. Well, cool. Thank you. Calling for for joining us. Let’s real quick tell us how people can get connected with you or follow you and and maybe some upcoming events you have coming up.

Colleen: Yeah. I’m just calling Colleen C. Tartow, I think, on Twitter and you can find me on LinkedIn. And I also have a data newsletter that my friend and I write on Substack, which is the sequel SEQUEL.Substack.com. And I’ve got a lot of writing on the Starburst blog and it’s, you know, that’s where my writing tends to happen. But yeah, I mean, I’d love for people to reach out. I love nerding out over data.

Ryan: Awesome. Well again, thank you so much for joining the MAD Data Podcast and I hope to talk to you again. Hope to see you at some of the conferences coming up as well.

Colleen: Yeah, absolutely. Thanks for having us it’s been fun.

Josh: Thanks Colleen.