Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now
> Episode Details

Data Quality: What's Your Plan?

Data quality is the data industry’s holy grail – desired by all but mysteriously elusive. Sam Bail, who speaks frequently on the topic of data quality, joins Sarah Krasnik, Data Engineer at Perpay, to discuss the building blocks, organizational mindset, and strategic planning that are needed to take data quality from theory to practice.

About Our Guests

Sam Bail

Data Consultant

Sam Bail is a data professional in New York City with a passion for turning high quality data into valuable insights. Sam holds a PhD in Computer Science and has worked for several data-centric startups in the healthcare and data infrastructure space.

Sarah Krasnik

Lead Data Engineer Perpay

Sarah is the Lead Data Engineer at Perpay and an avid tech blogger. Her passion lies in building and scaling modern data infrastructure to enable data driven decision making, and of course, writing about the process along the way.

Episode Transcript

Honor Hey, Harper, how’s it going?

Harper Hey, I’m just living the dream in Austin, Texas. How’s everything going for you?

Honor Everything’s going good. So today we actually have two amazing guests joining us, Sam Bail and Sarah Krasnik. I’m really excited to have both of you on to talk to us about data quality and actually want to hand it over to you really quickly. Maybe we can start with you, Sam, to tell us a little bit about yourself.

Sam Hey, yeah, I’m Sam Bail. In case you hear an accent, people are always wondering, I’m actually originally from Germany. I’m not Canadian or South African. As some people guess my background is in, I like to call just data things. I’ve done everything from data engineering, data product management, data analytics. I spent five years a company called Flatiron Health, working with third party health care data, which is notoriously messy and chaotic and patchy and really challenging to work with. And that made me so, you know, focused on data quality that actually joined a company called Super Conductive. For a while, working on a tool called Great Expectations, which is an open source data quality framework. So I did that for a while. I did a lot of data quality stuff. And as of last month, I am with a company called Collectors Universe, where I am tasked with building a data platform from scratch. And I’m really excited about doing doing it the right way and actually starting adding data tests and data quality checks and everything in there. So I am very passionate about data quality.

Harper Yeah, yeah, congrats. That’s a really cool job. Thank you. How about you, Sarah, how did you kind of find yourself in the space?

Sarah Yeah, I’m going to jump off of Sam’s kind of background and data things intro. I definitely relate to that. I have been kind of in the data space with a math background originally, but in the data space for about five years now, I’m kind of started more in the general solutions engineering than analytics and data science doubled in sales for a little bit, if anyone can imagine that. But now the data engineering at Purpoee, which is a startup in Philadelphia, which is where I’m based. And as for kind of tooling and background, very focused on infrastructure as well as data quality. And then just like using communication between analytics and other teams to kind of understand raw data and how it relates to the business and the business context around it, so that as an analytics team, we can build things that kind of make sense for the stakeholders. So very excited to talk about quality because I think it’s really important.

Honor Definitely, and I know that both of you. Well, thank you for sharing that background. Clearly, both of you are passionate about data quality, and I know that you’ve both actually done and presented some talks around. Correct me if I’m wrong, a stack of dbt, airflow and Great Expectations. So tell us a little bit more about the kind of the real life use cases and stories that inspired this topic for you personally.

Sarah Sure. I can start. So for me, I mean, if you think about testing, like all of the software engineers that I interact with, they test their product, they write unit tests. It’s usually, you know, an OKR to write more tests for something or another. And so I have always kind of for the last several years worked very closely with software engineers and really piggybacked off of that mentality. And so data quality to me, it’s even larger. It’s just about testing things and and that could relate to testing analytics code. It could relate to testing, you know, Python code, SQL Code, kind of everything across the board. But when it comes to data quality, understanding kind of how the the testing framework fits in and enables trust in data and trust in a team that we know that this is what we expect of the data and these are the things that are actually happening. And also, I kind of this definitely emerged from me. So it was just silent failures. So silent failures to me are just the worst kind of failures because they happen. You don’t know about it. Someone else comes to you about it, and it’s something that you should have found. So to me, data quality and data testing enables faith that when there is silence, it’s actually a good thing.

Sam It’s always been really interesting for me that in software engineering, test driven development is such a big thing. And you know, at least like at the companies I worked at, you couldn’t really ship a new feature or even get a PR accepted without adding the right tests or without modifying. The tests and tests are just sort of bundled with the code that you write, and it’s pretty like, you know, common sense and pretty just natural for any software engineer to be also worrying about the tests. But with data, it’s sort of, you know, you kind of there’s always something that goes wrong, right? There’s always failures or, you know, your stakeholders at some point find out like, Oh, here’s some, you know, the numbers look kind of weird, but I don’t think also because data engineering has been such a new discipline, you know, and all the tools have just started to sort of gain traction. There is no, let’s call it culture of, you know, data testing and actually paying attention to data quality as part of your workflow when you start building a stack. And that’s a really interesting to me that it’s sort of almost usually it’s like an afterthought, right? Like, Oh, something happens, Oh man, we need tests. No software engineer would be doing that right is like, Oh, our product broke or the feature broke. Oh, I guess we should start writing tests right? Like now you just do it from scratch. You do it right from the start. So I think what’s working for me is to think of your data pipeline as you know, to think of tests as a component of every single data pipeline. Obviously, you can go as deep or broad as you want, but testing just in the same way as you do with with software testing, you know, you could write some high level tests and be done with it, or you could test every single possible thing that happens. I think that always depends on the use case, but it’s super important for me to kind of just make that, you know, part of the pipeline, which is why I talk about dbt, Airflow, Great Expectations. It doesn’t really matter like that. It’s dbt, Airflow and Great Expectations, but it’s sort of it’s your transformation steps, it’s your orchestration and it’s your testing. And that’s the whole package. And that’s how you should think about building data pipelines. Yeah, Sam, I want to kind of hone in on something that you said that I think is really important, which is having this differentiation between software and data teams that I would hope is decreasing, but previously has been that like testing is an afterthought. And I think that’s also very tied to culture. So for example, if if there is a an extremely high priority ask or report or something that needs to be built, it’s it’s almost counterintuitive initially to say, Oh well, I need to build the rapport, but then I also need to spend X amount of hours or X amount of time building the test. But my counterargument to that is always, well, if it’s so high priority, then we should be building those tests to ensure that the data that, you know, every however many people or however many executives are going to be looking at is actually accurate. So it’s almost, you know, it’s it’s like a double edged sword of whether you prioritize initially and delay the ask or if you, you know, get it out as fast as possible. But then you iterate because there are issues. And that’s personally, I never like to be in that position.

Harper First thing, I really love the framing that Sam had. There were big data things. That’s something that I find really resonates with a lot of people that I talked to in the data community. We tend to have people that come from a lot of different industries, a lot of different, like even practitioners. Sarah, you mentioned your math background fairly common that we see as well coming into the science of the analytic side. And myself, I my first engineering role was actually like as a quality engineer at a large company. And we they brought me in and I was like, OK, this is going to be fun. I’m going to be able to really like, work with a big organization and work with the big repo like, really build my skills as an engineer. And within the first month of being there, I realized that, oh no, they need me to come in and talk about data quality and talk about like not just quality engineering, like they know how to do that for their software, but they really don’t have a good guidance on the way that, like this, should work within the data organization. And really, how do you apply those best practices from software to data in the same manner? And as Honor mentioned, it’s something I always talk about is like data quality is kind of like the unspoken secret in the data community where everyone knows that they have this problem, but no one really has a good solution to it. So it’s exciting to see tools like Great Expectations and even Deequ come out that are creating these standardization and frameworks for us to apply that testing that Sarah talked about. But just curious to get your opinion on why that’s such a difficult. Topic to to standardize and be able to get agreement across the entire community about.

Sarah So from my perspective, I would say I’ve sort of touched on that earlier when I said of data engineering is a fairly new discipline. I’m I don’t know like maybe five or six years ago. I don’t think data engineer was necessarily a job description or it was a thing, but it wasn’t. It was sort of people weren’t paying that much attention to it. I think it was data science was like the the big thing, right? Like five, six years ago, everyone wanted to be a data scientist. It was all about machine learning and models and everything. And I think we’re starting to shift that towards thinking more about the data as the key asset and the thing that’s actually important. And then when you have good, high quality data, you can build your models on top. And I think people are starting to become more and more aware of that. And at the same time, all the tools in the entire data engineering ecosystem is, well, I mean, it’s just been exploding over the past few years. But like dbt, it wasn’t a thing, right? In like tests with dbt. So out of out of the box weren’t a thing. A few years ago, when I started doing stuff at Flatiron in 2014, we’re writing everything in-house. We built our own version of dbt. We built our own version of Great Expectations, basically because there wasn’t a go to thing, because back then, any sort of like data engineering, ETL, whatever frameworks were big and clunky and sort of, you know, there wasn’t a lot of open source stuff or a lot of these like not like very mature and sophisticated open-source tooling. And I think because of that, because there wasn’t a go to like this is, you know, if you want to do testing data testing, this is what you use kind of like with, you know, if you write Python code, you implement Pytests like Python Test is like the go to package or, you know, there’s a bunch of others. But for every programing, language or sort of the go to thing to use to write your tests or to do other stuff. And there wasn’t anything like that for data. And I think that’s just about like happening, that there are a lot of robust, mature tools out there that actually make it a lot easier where you don’t have to write stuff like from scratch and build it in-house. I think the other thing that’s also really interesting is that we haven’t really talked about this. I think Sarah mentioned a little bit at the beginning. What I find really interesting about data testing is you’re kind of testing two different things. You’re testing a the data that comes into your system and you’re testing your transformation code to make sure you know your transformations actually make sense. And I think that’s a big difference from like traditional software engineering testing, where you’re really just testing the code. And, you know, all everything else is a fixture and you know, your fixtures are correct. And I think with data testing that it makes it, it just adds another layer of complexity, really to say, OK, you know, something failed. My column is all nulls. Is it because we wrote buggy code? Or is it because wherever we’re getting the data from, you know, they change the schema or something went wrong there? And I think having those two different aspects just makes it a lot harder also to know where to get started, right? So it’s I think there’s like multiple reasons why this is just about becoming a thing and people are figuring out best practices and figuring out where to start because it’s pretty complex.

Sarah Yeah. And just to pivot off of that, I think that with that complexity, it’s almost there is in in the data space. There’s so many more unknowns that you just don’t know, like you don’t know what you don’t know. And I think there’s just more of that with the data because you don’t control what’s coming in. Its raw data is generated elsewhere. Is you rely on communication and documentation, but those are very, you know, those are points of failure. And I also think that this is unknown. Speak to what make great team members on the analytics team, whether it be analytics engineers, data analysts, data engineers, it doesn’t really matter. I think it’s a concept of taking this transforming business context and understanding the business and what you’re being asked and then taking on, you know, putting a different hat on, which is the engineering hat understanding implementation, the raw data and marrying those two together to actually deliver a specific data product. And so with that from if you kind of jump to the software engineering side with Pytest, it’s only the latter. It’s only this engineering side that tests very specific code and you don’t really have to as long as your code works. As Sam said, you don’t really have to worry about anything else with data. There’s so many other things that you have to worry about, and that’s where the business context comes in. And that’s a really challenging thing to do. And it’s that’s why I think the people who excel kind of in the data space, non analytics teams in the teams that excel, find a group of people that altogether can marry those two things.

Honor What have you found to be like a. Good. Jumping off point in getting the team to the same page on the same page to really have a constructive conversation around this.

Sarah Yeah. I’ll jump in here. I think I’m a very big proponent of targeting it, kind of. Shaping your message to depending on your audience, so, for example, as I’m trying to, let’s say I’m not going with open source and I find a closed source data quality tool that I want to go with if I’m trying to pitch to some executive why we should pay for this tool, I’d structure my request in terms of, well, this is, you know, if we wouldn’t have caught this error if this error occurs, this is how much revenue we would lose. And that’s like a very specific number of very specific things some can tie to. If I’m talking to an analytics team, it would be. This is how many hours without this is how many hours you would spend digging into something that this could have just pointed you directly to that in one minute if we had this framework. And then if it’s for engineering, it’s about, well, this is how the framework works. Here is how it how it’s like something like Pytest, and here’s how it expands on it. And here it’s like something. It’s something like pie test, and here’s how it expands on it. That applies to data, but doesn’t. And this is why engineers haven’t used it, because there are these kind of unknowns.

Sam It is a little bit of a tough place to get started, honestly, because it feels like with security, for example, where you know, you kind of don’t really have to worry about security if nothing bad ever happens. You know, everyone knows like you should do whatever penetration testing and whatever it is, right? Or maybe even just, you know, make sure you don’t give right access to every single engineer on your database, right? And just give them read only. But if nothing ever happens like it, it goes completely unnoticed, and no one really appreciates any of the precautions that you’re taking. So it’s a little bit of a tough spot because obviously you don’t want any quality issues to happen or any dashboards to blow up or whatever. But it’s hard to justify investing it, investing in data quality and spending time writing tests. If you, you know, if if there’s no obvious reason, right? It’s like very hypothetical. Someone compared it also to like health insurance, right? Or like just like taking care of, you know, preventative medicine where it’s like, it’s sort of very hypothetical because you don’t know what’s actually going to happen. One thing that I’ve noticed just working with great expectations back in my day, that’s really effective. That’s always kind of fun to sort of get. The AHA effect is to use something like an automated profiler, for example, that runs over your data set and then shows you just like, you know, the profile and automatically generates generates tests based on that. And a lot of times we’ve actually had we’ve had some like implementation hackathons or kickoffs, or we just ran great expectations over people’s data sets. And it looked and it was like, Wait, like, this test feels like I do have nulls in this column like this. This is supposed to be there or wait, there’s a negative value in there. This shouldn’t be here, right? Because a lot of times people don’t actually know their data like super well in terms of like every single column, every single row. So are you just getting started and throwing some tests at it and just seeing what happens has a lot of aha moments and, you know, a little bit of a light bulb sort of moment of, wow, there’s actually something in our data that I didn’t even know. So so I think just getting started with not even necessarily, oh, here’s all the errors, and we’re going to catch the errors and do alerting and whatever it is. But like, this is just a way for you to kind of get to know your data and document your data, which is also like how software tests occasionally use right. Your tests kind of are contribute to your documentation of what you expect, your functional whatever to do like using a little bit as documentation and just to get insights into the data is actually super, super interesting. So even in the absence of like big dramatic errors and and bugs and whatever that is usually seems like a very good way to make people understand, Oh yeah, we need to know this because otherwise we won’t be able to catch it, and that might cause errors downstream.

Sarah Yeah, and I think, you know, for for anyone listening, that’s like assessing whether, you know, you should spend all this time bringing on, you know, a tool and building out a framework. Well, it doesn’t even have to be that complicated. It could be not only profilers, but if you, you know, just time boxing like, I think time boxing is so important, time boxing trying to bring something on that, you know, it doesn’t even have to be that long could be like a day. And if you find a bug and you that was within the time box of a day, then to me that already speaks volumes of what you don’t know.

Sam And I can’t give you any data like whatever you do with your data, you will find something in there that’s probably unexpected, right? Yeah. So even if you just spend an hour looking at your data and coming up with hypotheses of like, Oh yeah, this should be, you know, these values in this column should be in this range from like zero through five. And then, oh what? There is a negative number in there. What’s going on? I can guarantee you there will be something wrong that people find, which is always kind of fun.

Harper One of the mantras I tend to bring to the projects that I work on in the past is you talked about. You mentioned that the frameworks, but they don’t have to be super complex and all to be like perfectly created whenever you start. And so the mantra everything is like the best test that we can right is the first test in the sitting there trying to figure out the best way to write a test or what that test should do is going to create that analysis paralysis. That’s often been the downfall of a lot of big data projects in the past. And so we’re just getting out there and starting to look at what it looks like to implement. These testing is really the best first step that I’ve often come across when it came to data quality frameworks. But then just to highlight the theme of complexity that you brought up, Sam whenever we were talking about, like why it’s difficult to get these data quality frameworks set up, I love the way that you talked about with data testing. You have not only the software piece or transformation side, but then also like the data testing as well and curious to hear from both of you, like when it comes to building out a data quality like a mature data quality framework I. What are the types of characteristics that you’re looking to build there? And just starting from the point of how important is it to kind of separate your data test from your software test or your logical test? Or do you actually advocate for like combining those and things up at the same time? I’ll go to you Sarah to see what you might think of other?

Sarah Sure. I think it’s more important to kind of to think about it in a way of the action item from an alert. So whether it be infrastructure code or the data itself, if you’re at a test and this is sort of like an “FYI something is changing about the business” and we kind of need to dig into it a little bit more to me, that’s not an alert. That’s more of a, you know, trying to understand something a little broader than the specific thing that you’re testing. However, in any of those three cases, it into me, an alert is there is a bug and we need to go find a down right now, find it. Find the root cause right now, because if we don’t, then that has a cascading effect. So whether that be my infrastructure is down or there’s a bug in my business logic code that is flipping a yes and no value. So we’re showing something is yes when it’s no that has downstream business impacts or whether it’s data quality in terms of some upstream product is letting users do something in a particular application that they just fundamentally shouldn’t be doing. And we’re catching that. We work with another team, with another team to fix that ASAP. But to me, this level of criticality is really important to think about.

Sam Yeah, I think that’s a really good point. I actually wrote a blog post about that at some point that’s called “Your Data Test Failed. And Now What?” So testing is like kind of not very helpful if you don’t have an action plan. So what are you going to do based on that? And I think that’s actually a really good way to separate out the different types of tests, right? So one might be, oh, I actually made a bug in my code, you know, I’m going to go back and fix that and rerun the pipelines and make sure it’s fine. Another one is, Oh, the data is actually incorrect. We’re getting third party data. I don’t know. See, as we file on some as a FTP server and we actually have to go back and talk to the people who produce that data, right? So depending on what the action item is, so you take based on the test success or failure. I think it’s a different, you know, different type of test that you’re writing in a different point in the pipeline where you’re running that. I’m a big fan of separating out data tests and code tests and by, for example, just using like a golden data sets actually using test fixtures, especially during the development process. When you’re writing a data pipeline, you need a fixture writing if your data updates, let’s say, even even just every day, right? I’m writing some code, I’m writing some pipeline code, I’m writing a new transformation or whatever. I’m running it. I look at the results, you know, they make sense and then I run it again tomorrow and I don’t actually like and the the output changes that makes it really hard for me to develop against, like moving a moving target and know that I’m doing the right thing. So actually having a like test dataset like a fixture that I can develop against. And then also, if you do on continuous integration, if you’re running tests, I don’t know, like as part of like your your code review process, for example, you know, you want you want to know whether any kind of issue or any kind of failure came from the data itself or came from the code that you’re writing. So I’m actually a big fan of like golden datasets and stale data sets. The only caveat there is that you might not like you might not have the perfect data set the tests every single thing. Right. So just because your tests pass against your fixture doesn’t necessarily mean that, you know, every single edge case is actually going to go to work out. So that’s the one thing. I think that’s really important to point out there that don’t rely a hundred percent on your fixtures, either. I do also think that fixtures can go beyond testing, where fixtures also allow you to, I think, release analytics on new products faster because you can build a fixture for a case that hasn’t happened yet in production, but you know your engineering team is building that in production. And so you can not just write SQL or Python that are just theoretical on a case, but write it against a fixture. Test it and then release it with at least some sort of some sort of dataset. So that, to me, is also an argument for spending time having a testing framework because it goes beyond data quality and beyond testing. I think it makes your analytics team just generally more productive.

Honor What do you do in situations where it’s nearly impossible to gain that level of familiarity with your data? Because these sources are ever changing and it’s you’re talking about super massive volume of external sources. What is a reliable method on managing data quality when you have zero control over external sources?

Sam I’ll leave that to Sara to answer, because I like her, I love her.

Sarah To me, I don’t think that an analytics team would ever be positioned to have full control because then they wouldn’t just be an Analytics team, they’d be a product team, they’d be an engineering team. There’d be, you know, an advertising marketing team. There’d be all sorts of things. And so I think this I’m just saying that I think this is a great question because I don’t think any analytics, like every analytics team, is going to lack control in some respect or another. To me. There is a fundamental difference in in terms of within data quality, in terms of accuracy of the data and what you expect or don’t expect to happen. And what I mean by that is accuracy is there is no bug in code. This is actually happening somewhere. So all of my fields are null and upstream. Some flag was turned off, right? That actually happened. But if all of my fields are null because one of my aggregations, because I botched one of my aggregations and someone didn’t notice it in the PR, that is just fundamentally not accurate. And so to me, it’s getting to a point where you accurately understand what’s going on and accepting that things are going to happen that you don’t have control over. But as long as you have visibility into it and develop an alerting system where you’re actually notified and can act on it. To me, that’s kind of the best place of this time. The analytics team can can get to where the team is communicating outwards and any issues and asking those questions, as as opposed to those questions coming in like three weeks later, this dashboard looks wrong. I happen to not look at it for three weeks. Why didn’t you catch it?

Harper I just I feel like I should just sit and stay out of the way because everything that they’re saying is just like gold, as far as I’m concerned, like these are these are all the conversations that I’ve had in my previous organizations, in the various Slack channels. And I meant where it comes to how do we really answer this question? The complexity is the thing that I come back to a lot. I often say that data engineering is a high context field, and it’s hard for any one person to define exactly the right answer. And that’s going to be going into this and even talking to the different key elements of the data quality framework. You know, we have the different tests that we want to have between software and data. We talk about the. The golden dataset, which big advocate for that as well. Hopefully, you can find some way to like scrub production data so we can get it as close to what you’re actually going to see as possible. But I know it’s a challenge for a lot of people. But then the actionable asset aspect of it, though, like that’s that’s the one key part that doesn’t get left out. People know that it needs to occur, but it’s kind of like the afterthought is right. Let’s get these tests running. Let’s get this rolling. Let’s make sure our MRs are going through. Let’s make sure that everything is going to work the way that we expect it to. But OK, once we get that running, how do we make it actionable? What what are the right metrics to alert on and like who needs to receive those alerts and who needs to own that? And I like what Sarah said about like communicating outward with those alerts and trying to make sure that you’re not just internalizing this information inside because it’s not useful. If your team gets the alert and then you look at that issue, you find a way to resolve it. You either fix your code or you document how to resolve it like a rollback for data that has gone bad. But if that’s then communicated outward, then your stakeholders are data consumers. They don’t know that their data was actually out of sync or giving bad information. And then you can end up having poor business decisions made off of data that they expect to be true, but in the end is not actually true. And that kind of brings the conversation for me from like the data quality conversation warrants, like the data governance and like, you know, if you want data ops, that’s going on. And so curious what you all think in terms of like who needs to be involved in these conversations, like whenever you’re building out this data quality framework, it’s great if that engineering team knows how to implement it. It’s great if the data obscene knows how to build up the platform for it. But are there other stakeholders that should really be involved in that in that process of building it out? And that should be part of the feedback loop to to make sure that you’re constantly improving and meeting the goals of the business?

Sam Yeah, I think that’s a really important conversation to have, because a lot of times, especially when, you know, coming from a technical background and engineer, you sort of think of like the technology solution as like, OK, this solves a problem. I implement my tests and then everything is great, right? But that’s not how it works. And it’s actually again, like, I think it’s a pretty substantial difference from testing software, where there’s usually just two places where your software is being tested. You run your test during development and then, you know, the test run as part of your continuous integration. And it’s pretty clear what happens if the test fails. Like the person who made that change goes and fixes the code, whereas with like data test failures, it’s sort of, you know who who owns this, what is the what is the you know, what are the actions that you’re taking? What’s the sequence of action? Who needs to be informed? In a lot of times, that gets extremely cross-functional and spans multiple roles, right? There is probably someone on call, maybe, or someone who gets alerted to begin with who sees that issue. And then what happens next? You communicate down to your stakeholders or do you stakeholders already get those notifications? Are you stakeholders, the ones who say, Hey, I got this alert that something’s wrong with the data and then tell you. And then the question is, OK, who’s responsible for then digging into it and figuring out what happened? And a lot of times that means you have to communicate, you know, to the data producers, whoever that is, whether that’s internal or external. There is. There’s a long chain of things in communications that have to happen. And I think you’re, you know, the best data tests, right, are kind of useless if you don’t already have a plan like a strategy for who is actually responsible for what. If you want to go one step further, you can even have, like I said, you can have people on call who are responsible for that. You can have SLAs around that right in terms of this is how we interact across teams. So your stakeholders know, OK, you know, any data quality issues with my dashboard will be addressed within an hour or two hours or a day, depending on criticality. I think Sarah mentioned sort of the level of criticality and that’s part of your like data quality strategy is really having having this action plan, not just implementing testing and, you know, sending them out into the wild would be like, cool, let’s see what happens, but it’s really, really understanding who are the important people that need to be involved in that. And yeah it does get more complex because, like I said, a lot of times it spans multiple teams, right? Just thinking back to my old job, we we had it started really with the engineering team and a DBA team that was working on the system that we were getting data from. Then we have the data platform team. Then we had the analytics team that was writing transformations. And then at some point it got to us where we’re like, Oh man, something’s wrong there. And we had to go all the way up to the engineers and the the database admins right to say, like, did you just change the schema recently? And actually, that’s just to pile that on and mention very briefly, I do think having an internal process and collaboration across teams across like data producers and the people who do the analytics to make sure there is communication as to what changes happen, how changes would impact the downstream data, things like that. It’s actually extremely important to that to me, is also part of the data quality strategy is that it’s not just someone produces data, throws it over the fence. And then, you know, the best alerting downstream is sort of you might as well just try and not make the error to begin with right, rather than sort of like catching it downstream because you have very long sort of feedback loops of that. So that’s also one thing I think too to consider is like, how can we actually prevent errors rather than just detecting them?

Sarah Yeah, I actually I would completely agree with that and actually think that having a process is a great example of slowing down to speed up. If you don’t have a process and you don’t think about that strategy, then, Harper, I think this speaks to something that you said earlier. You get into this analysis paralysis that you build the build test for absolutely everything under the sun and then you spend, you know, all of this time doing that and then someone asks, OK, well, now what? Now we’ve built tests, but who’s going to look at them, who doesn’t respond to them? Like what is actual strategy around this? And I think that actually establishing a strategy and a process for moving forward. I honestly think it’s a great place to start after you understand what you can test, right? And understanding the realm of possibility and then having a strategy, building out a whole testing framework for even a mid-sized organization for, you know, a team of five even would take a long time. And so. Onboarding and data quality tools, success doesn’t just doesn’t start and end with building out tests for everything to me, and I think you know the to what you said. The success is having a strategy, knowing where you’re going and then just growing, there are going to be new there. They’re going to be new products are going to test case. It’s it’s not a static thing. It’s a very living, ever growing thing. The data quality is and so having the strategy can evolve with it, and the strategy will dictate how people respond and the success at which you can actually resolve any failing tests.

Harper I really agree with what you said in terms of like having that strategy and involving different stakeholders in it being a cross-functional effort. Because for me, like I use the analogy that like having that strategy in place and getting feedback not only from the engineering teams, but the stakeholders, consumers and also like business executives and maybe making decisions off the dashboards that you’re ultimately providing, that kind of creates the sandbox that you’re going to build within, right? And then once you have that sandbox set up, you can create a data quality framework that meets those goals that you have and addresses the SLAs that you have set up and that sandcastle that you build within that sandbox is going to look completely different, depending on the organization that you’re in. So I love the idea of starting first with the conversation around what is the strategy with all the stakeholders and then ensuring that you build a data framework data quality framework that addresses the needs that have been decided upon within that strategy.

Harper What kinds of metrics do you think are relevant in this conversation? Is there a universality to that or is it going to go back to very specific business context to measure how we’re doing in data quality?

Sarah Sure, I can take this one. So I think in terms of metrics, I think there are kind of two things I’m thinking about. The first is, well, when you implement testing, how do you actually implement and what do you test? But then also the metrics around how successful your testing framework is. So just start with the former to me, there are I kind of split tests up into two different categories. The first is like like formatting tests. In terms of this column, that’s a percent is supposed to be between zero and 100 and not between zero and one. This column is supposed to be a string, and it wasn’t changed in integer stuff like that. That will just like that will just break that. If change will just break things downstream, even though it’s not really about the data, it’s just about like formatting. There’s no business logic, it’s just the way that the world is assumed to be. And then the second is business logic test. So for example, if we know that our, you know, if some no deal, the daily number of signups is supposed to be between X and Y that it’s on average it’s roughly there and it’s not zero and it’s not one, but it’s also not like 10 trillion. So understanding that and that’s going to be different for each business, right? You have a super early stage startup. You have a gigantic enterprise that’s just going to vary. But I think the approaches, that’s where kind of the business comes in in terms of understanding your business. As for kind of the measuring the testing framework, I think that comes down to what what Sam mentioned earlier in terms of SLA is how quickly can you respond to issues? Are you communicating outwards and understanding, you know? What do your stakeholders expect of your data? Do they expect that it’ll always be six hours of data and if and if they do, expect that? How does the testing framework help you achieve that?

Harper It’s great. Sam, what do you think about that? Do you have other metrics or how you would think about metrics in those terms?

Sarah Yeah, it’s a it’s a tough one because like I said earlier, a lot of times like if you’re, you know, let’s assume your data testing framework is either actually perfect or there’s no issues ever in the data like you just won’t find anything right. And so, so like number of bugs detected or whatever is sort of like a pretty weird metric. I do think, SLAs with the data stakeholders in terms of, yeah, like the data needs to be up to data up-to-date every day or here are like our expectations of what the data should look like and it always looks like this. And then basically just saying, well, either if we accomplish this, if we meet those SLAs, it’s either because our data testing is great and because we find errors and fix them or because or we, let’s say, you know, quarantine rows or quarantine records that don’t meet those criteria or because maybe the data is just really good, but it it’s a little bit of a tough one to say like, you know, we’re doing the right thing because you don’t really know if nothing happens. But I do think you know, your job as a data team is to give your stakeholders good quality data. And whichever way that happens, if it happens, you’re doing a good job. That’s sort of a maybe not super satisfying answer, I guess.

Harper Yeah, I think there’s never going to be like a good, satisfying answer that’s going to address the entire market or all the different use cases when it comes to the various data teams that exist out there, especially when it comes to metrics, it’s hard to really pinpoint exactly what’s going to be relevant for your team or for your business or for your use case without having that domain area knowledge. And again, that’s where the strategy is important, engaging the stakeholders. One tried and true a conversational piece that I use when we talk about metrics is just that. The five years of big data that came out ages ago that has evolved over time where like you look at volume, how much? How big is your data set? The records are coming in, the expected you look at velocity. Is it coming in in the time from the expected to variety comes up in terms of like schema and data. Types of each of those are matching veracity speaks to like the different business or business level I wanted can speak to the business level KPIs that Sarah spoke to in terms of the business value between zero one ownership between zero and 100. And then the last one being value, where you really bring in those custom business metrics that are important to your team. Is the sales data actually reasonable for this particular market of a certain amount of time? But there’s just no there’s no silver bullet here, right? There’s no perfect way to really quantify those without taking the time to engage the entire organization and understanding what you want to achieve with a data quality framework. And then that’s really what can help you to find the metrics that you want to find in the other day.

Harper No, I mean, that’s that definitely makes a lot of sense, and so too. So I wanted to get each of you to maybe offer up like your pro tip on if we were to, if any, did a team listening to this decides. All right. We’re going to want to take action on this and we want to start being really intentional about data quality. What would be the first steps that you would each recommend, regardless of their stack? Like, what? What’s what do we what’s immediately actionable?

Sarah So I think immediately actionable, first thinking about. Well, where we started this conversation of what has happened in the past, right? You can learn from those mistakes and very quickly understand how can fix very specific things. And then next is really, I think, thinking about strategy and process earlier as opposed to later is kind of the key to success. What are you trying to achieve? Are you trying to ensure freshness? Are you and try trying to instill trust and starting from there, then jumping to kind of, how are you actually going to do this? What tools? What tools are you going to use? What frameworks are you going to use? And with that, I think kind of again, jumping back to where we started of software engineers have high tools. They have kind of their go-to tools. And although that doesn’t necessarily exist everywhere super explicitly for the data fields, I mean, in this the entirety of this recording, Sam and I didn’t really disagree on anything, right? And I think that speaks to, you know, we didn’t conspire ahead of time or anything. And I think that speaks to, you know, the the analytics field is evolving, is moving and there will be some sort of standard that that people that people work towards. And that standard is ever growing, and there are certainly blogs and articles and books on that now. So that would be a great place to start.

Sam I kind of want to address yet another elephant in the room. I think we’ve sort of set that implicitly, but I haven’t made it explicit. The whole thing about having the strategy, engaging stakeholders, coming up with stuff that is not that’s not necessary and engineering like type of role, right? So expecting data engineers, expecting analysts, expecting data scientists to really do this like whole cross-functional communication, thinking about thinking about alerting, thinking about engaging stakeholders and processes and stuff that is something that is actually really well-suited for a data product owner. Right. So one of the things that I’m actually very excited about for my team, I’m a data engineering team of one, but I said I’m building a data platform. I need a product manager like I want someone who whose job it is really to do just that and to engage stakeholders, to think through processes, to think through the strategy, think through metrics and KPIs and that kind of stuff. Because I do think a lot of times that sort of ends up like being like the data engineer is a data team has to do it. But that is not necessarily 100 percent. Our job, in addition to also then actually writing the code and doing everything else. So I think my like first sort of step and I know a lot of data teams, you know, won’t be able to just say, like, Hey, can we have a product manager? Like, can we, you know, obviously staffing isn’t that straightforward. But I would definitely, you know, suggest to every data team that’s thinking about, Hey, we want to have a like data quality strategy or any kind of data strategy to think about the fact that there might be room for a specialized role for someone who actually owns that product just in the same way as people own, you know, features of software or like any other like code that is being built. And I think that’s something that’s often neglected and sort of just like dumped onto the data team. And then people are kind of like scrambling to make it work. But having someone who’s really focused on that can be extremely, extremely valuable and extremely beneficial.

Harper There’s not a whole lot I can add here, because I think that Sarah and Sam really nailed what I would, what I would also recommend at the same time. I think that the only thing that I would highlight is it’s important to recognize that data quality doesn’t exist in a vacuum, that it’s something that needs to be addressed throughout the entire organization and it takes communication for that to occur. That outward alerting that Sarah referred to earlier and then really highlighted by Sam, the point that she just made around the fact that if you’re a data engineer, you’re listening to this podcast. Don’t feel like you have to go out and design the whole data strategy for your entire organization, and you’re going to have to do that as well as write the code as well as create this CI/CD pipeline, as well as find the tools. Yeah, so just please hear the fact that what you can do is start the conversation, you know, and so that’s really that’s my actionable step is just start, you know, the first the best as you can right is the first is make sure that that first test is going to meet the goals that you have. Make sure the goals that you have are understood by engaging the organization and understanding what you want to achieve with it. And if you’re an organizational leader, make sure that your data team is set up for success by having the right people in the right roles to do the right tasks. Because at the end of the day, if you have data engineers focusing on what they’re good at, writing code, managing data, owning that and being the keepers of trust in your data and then having product managers that are good at aligning those goals with the rest of the organization. You’re going to have a much more effective data organization at the end of the day and by really by setting that framework up, a setting that structure up within from like a human organization, you’re going to inherently get data quality out of it. Because if you set up a quality human organization, we’re going to work together to create a quality product at the end of the day.

Harper So that was really great. Thank you everyone for participating in this conversation. And I and I’m only just noticing Harper that you literally have an elephant in your room. I didn’t notice that in all the Zoom calls that you have a poster of an elephant in your room, and I feel like that is the conversation starter.

Harper I love, I love. I love all the podcast people that are going to listen to this and be like, I can’t see the elephant. What’s it look like? You can see it on your post, a picture of that

Harper as well

Honor Thank you, Sam Sarah, for joining us today for this conversation. I do want to repeat that a quote, Sarah, that you said earlier that I really loved, which was slow down to speed up. And that is such an important thing to consider as any team tries to implement data quality in a really meaningful way. We look forward to seeing you again very soon. Take care. Goodbye.

Sarah Thank you. Thank you.

Stay Connected

Sign up for the newsletter