The term “data lineage” has been thrown around a lot over the last few years.
What started as an idea of connecting between datasets quickly became a very confusing term that now gets misused often.
It’s time to put order to the chaos and dig deep into what it really is. Because the answer matters quite a lot. And getting it right matters even more to data organizations.
This article will unpack everything you need to know about data lineage, including:
- What is it?
- What’s the difference between data lineage and data provenance?
- Why is it important?
- What are common data lineage techniques?
- What are data lineage best practices?
- What is end-to-end lineage vs. data at rest lineage?
- What are the benefits of end-to-end data lineage?
- What should you look for in a data lineage tool?
What is data lineage?
Its purpose is to track data throughout its complete lifecycle. It looks at the data source to its end location and notes any changes (including what changed, why it changed, and how it changed) along the way. And it does all of this visually.
Usually, it provides value in two key areas:
- Development process: Knowing what affects what and what could be the impact of making changes.
- Debugging process: Understanding the severity, impact, and root cause of issues.
In general, it makes it possible to identify errors in data, reduce the risk associated with system and process changes, and increase trust in data. All of these are essential at a time when data plays such an integral role in business outcomes and decision-making.
Data lineage in action: A simplified example
When data engineers talk about it, they often imagine a data observability platform that allows them to understand the logical relationship between datasets that are affecting each other in a specific business flow.
In this very simplified example, we can see an ELT:
- Some pipeline tasks, probably running by Airflow, are scraping external data sources and collecting data from there.
- Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse).
- Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.
- Dashboarding tools such as Tableau, Looker, or Power BI are being used on top of the business data and providing visibility to multiple stakeholders.
What’s the difference between data lineage and data provenance?
Data lineage and data provenance are often viewed as one and the same. While the two are closely related, there is a difference.
Whereas data lineage tracks data throughout the complete lifecycle, data provenance zooms in on the data origin. It provides insight into where data comes from and how it gets created by looking at important details like inputs, entities, systems, and processes for the data.
Data provenance can help with error tracking when understanding data lineage and can also help validate data quality.
Why is it important?
As businesses use more big data in more ways, having confidence in that data becomes increasingly important – just look at Elon Musk’s deal to buy Twitter for an example of trust in data gone wrong. Consumers of that data need to be able to trust in its completeness and accuracy and receive insights in a timely manner. This is where data lineage comes into play.
Data lineage instills this confidence by providing clear information about data origin and how data has moved and changed since then. In particular, it is important to key activities like:
- Data governance: Understanding the details of who has viewed or touched data and how and when it was changed throughout its lifecycle is essential to good data governance. Data lineage provides that understanding to support everything from regulatory compliance to risk management around data breaches. This visibility also helps ensure data is handled in accordance with company policies.
- Data science and data analytics: Data science and data analytics are critical functions for organizations that are using data within their business models, and powering strong data science and analytics programs requires a deep understanding of data. Once again, data lineage offers the necessary transparency into the data lifecycle to allow data scientists and analysts to work with the data and identify its evolutions over time. For instance, data lineage can help train (or re-train) data science models based on new data patterns.
- IT operations: If teams need to introduce new software development processes, update business processes, or adjust data integrations, understanding any impact to data along the way – as well as where data might need to come from to support those processes – is essential. Data lineage not only delivers this visibility, but it can also reduce manual processes associated with teams tracking down this information or working through data silos.
- Strategic decision making: Any organization that relies on data to power strategic business decisions must have complete trust that the data they’re using is accurate, complete, and timely. Data lineage can help instill that confidence by showing a clear picture of where data has come from and what happened to it as it moved from one point to another.
- Diagnosing issues: Should issues arise with data in any way, teams need to be able to identify the cause of the problem quickly so that they can fix it. The visibility provided by data lineage can help make this possible by allowing teams to visualize the path data has taken, including who has touched it and how and when it changed.
What are common techniques?
There are several commonly used techniques for data lineage that collect and store information about data throughout its lifecycle to allow for a visual representation. These techniques include:
- Pattern-based lineage: Evaluates metadata for patterns in tables, columns, and reports rather than relying on any code to perform data lineage. This technique focuses directly on the data (vs. algorithms), making it technology-agnostic; however, it is not always the most accurate technique.
- Self-contained lineage: Tracks data movement and changes in a centralized system, like a data lake that contains data throughout its entire lifecycle. While this technique eliminates the need for any additional tools, it does have a major blind spot to anything that occurs outside of the environment at hand.
- Lineage by data tagging: A transformation engine that tags every movement or change in data allows for lineage by data tagging. The system can then read those tags to visualize the data lineage. Similar to self-contained lineage, this technique only works for contained systems, as the tool used to create the tags will only be able to look within a single environment.
- Lineage by parsing: An advanced form of data lineage that reads logic used to process data. Specifically, it provides end-to-end tracing by reverse engineering data transformation logic. This technique can get complicated quickly, as it requires an understanding of all the programming logic used throughout the data lifecycle (e.g. SQL, ETL, JAVA, XML, etc.).
What are data lineage best practices?
When it comes to introducing and managing data lineage, there are several best practices to keep in mind:
- Automate data lineage extraction: Manual data lineage centered around spreadsheets is no longer an option. Capturing the dynamic nature of data in today’s business environments requires an automated solution that can keep up with the pace of data and reduce the errors associated with manual processes.
- Bring metadata source into data lineage: Systems that handle data, like ETL software and database management tools, all create metadata – or data about the data they handle (meta, right?). Bringing this metadata source into data lineage is critical to gaining visibility into how data was used or changed and where it’s been throughout its lifecycle.
- Communicate with metadata source owners: Staying in close communication with the teams that own metadata management tools is critical. This communication allows for verification of metadata (including its timeliness and accuracy) with the teams that know it best.
- Progressively extract metadata and lineage: Progressive extraction – or extracting metadata and lineage in the same order as it moves through systems – makes it easier to do activities like mapping relationships, connections, and dependencies across the data and systems involved.
- Progressively validate end-to-end data lineage: Validating data lineage is important to make sure everything is running as it should. Doing this validation progressively by starting with high-level system connections, moving to connected datasets, then elements, and finishing off with transformation documentation simplifies the process and allows it to flow more logically.
- Introduce a data catalog: Data catalog software makes it possible to collect data lineage across sources and extract metadata, allowing for end-to-end data lineage.
What is end-to-end lineage vs. data at rest lineage?
When talking about lineage, most conversations usually tackle the scenario of data “in-the-warehouse,” which presumes everything is occurring in a contained data warehouse or data lake. In these cases, it monitors data executions that are performed on specific or multiple tables to extract the relationship within or among them.
At Databand, we refer to this as “data at rest lineage,” since it observes the data after it was already loaded into the warehouse.
This data at rest lineage can be troublesome for modern data organizations, which typically have a variety of stakeholders (think: data scientist, analyst, end customer), each of which has very specific outcomes they’re optimizing toward. As a result, they each have different technologies, processes, and priorities and are usually siloed from one another. Data at rest lineage that looks at data within a specific data warehouse or data lake typically doesn’t work across these silos or data integrations.
Instead, what organizations need is end-to-end data lineage, which looks at how data moves across data warehouses and data lakes to show the true, complete picture.
Consider the case of a data engineer who owns end-to-end processes within dozens of dags in different technologies. If that engineer encounters corrupted data, they want to know the root cause. They want to be able to proactively catch issues before they land on business dashboards and to track the health of the different sources on which they rely. Essentially, they want to be able to monitor the real flow of the data.
With this type of end-to-end lineage, they could see that a SQL query has introduced corrupted data to a column in a different table or that a DBT test failure has affected other analysts’ dashboards. In doing so, end-to-end lineage captures data in motion, resulting in a visual similar to the following:
What are the benefits of end-to-end data lineage?
Modern organizations need true end-to-end lineage because it’s no longer enough just to monitor a small part of the pipeline. While data at rest lineage is easy to integrate, it provides very low observability across the entire system.
Additionally, data at rest lineage is limited across development languages and technologies. If everything is SQL-based, that’s one thing. But the reality is, modern data teams will use a variety of languages and technologies for different needs that don’t get covered with the more siloed approach.
As if that wasn’t enough, most of the issues with data happen before it ever reaches the data warehouse, but data at rest lineage won’t capture those issues. If teams did have that visibility though, they could catch issues sooner and proactively protect business data from corruption.
End-to-end data lineage solves these challenges and delivers several notable benefits, including:
- Clear visibility on impact: If there’s a schema change in the external API from which Python fetches data, teams need true end-to-end visibility to know which business dashboard will be affected. Gaining that visibility requires understanding the path of data in motion across environments and systems – something only end-to-end data lineage that tracks data in motion can provide.
- Understanding of root cause: By the time an issue hits a table used by analysts, the problem is already well underway, stemming from further back in the data lifecycle. With data at rest lineage, it’s only possible to see what’s happening in that particular table, though – which isn’t helpful for identifying the cause of the issue. End-to-end lineage, on the other hand, can look across the complete lifecycle to provide clarity into the root cause of issues, wherever they turn up.
- Ability to connect between pipelines and datasets: In a very complex environment where thousands of pipelines (or more!) are writing and reading data from thousands of datasets, the ability to identify which pipeline is working on a weekly, daily, or hourly bases and with which tables (or even specific columns within tables) is a true game-changer.
What should you look for in a data lineage tool?
As it becomes increasingly important, what should you look for in a data lineage tool?
Above all else, you need a tool that can power end-to-end data lineage (vs. data at rest lineage). You also need a solution that can automate the process, as manual data lineage simply won’t cut it anymore.
With those prerequisites in mind, other capabilities to consider when evaluating a data lineage tool include:
- Alerts: Automated alerts should allow you to not just identify that an incident has occurred, but gain context on that incident before jumping into the details. This context might include high-level details like the data pipeline experiencing an issue and the severity of the issue.
- View of affected datasets: The ability to see all of the datasets impacted by a particular issue in a single, birds-eye view is helpful to understanding the effect on operations and the severity of the issue.
- Visual of data lineage: Visualizing data lineage by seeing a graph of relationships between the data pipeline experiencing the issue and its dependencies allows you to gain a deeper understanding of what’s happening and what’s affected as a result. The ability to click into tasks and see the dependencies and impact to each one for a given task provides even more clarity when it comes to issue resolution.
- Debugging within tasks: Finally, the ability to see specific errors within specific tasks allows for quick debugging of issues for faster resolution.
Getting it right
Data lineage isn’t a new concept, but it is one that’s often misunderstood. However, as data becomes more critical to more areas of business, getting it right is increasingly important.
It requires an understanding of exactly what data lineage is and why it’s so important. Additionally, it requires a thoughtful approach to addressing data lineage that matches the needs of a modern data organization – which means true end-to-end data lineage. And finally, it requires the right tool to support this end-to-end lineage in an automated way.