How Databand Achieves Automated Data Lineage
Data lineage seems to be the hot topic for data platform teams. In fact, we’re doing an upcoming webinar on how data lineage is viewed in the industry, and how a more end-to-end approach solves a lot of issues with lineage.
In this blog, we’re going to walk through how Databand provides automated data lineage so you can easily diagnose pipeline failures and analyze downstream impacts.
Watch the video to see it in action or continue reading below.
Using automated data lineage typically starts with an alert. You can jump right into a lineage graph, but it’s important to first know why the graph is relevant.
For example, on the Databand alert screen, you can see all the data incidents and their alerts in one view.
This particular alert shows that a critical alert fired around our “daily_sales_ingestion” pipeline. Which is a business pipeline that processes our daily sales from SAP, does some transformations for different regions, and then sends it over into a BI layer.
Needless to say, this pipeline is critical for our business since it processes sales from around and eventually shows the results to the business.
To diagnose the alert, select view details, and now you are into an alert overview screen.
Understand impacted datasets
Before seeing the lineage graph, you can see the impact analysis across your affected datasets, pipelines, and operations.
View data lineage
Once you’ve seen what had been impacted, you can now visualize these impacts by selecting the data lineage tab. This graph shows all the dependent relationships between the initial pipeline that failed and any other dependencies that are impacted.
For example, we’re looking at tasks that are writing to a particular dataset and that same dataset being read by a subsequent task. All the red text in each pipeline represents anything that was impacted by the initial failed task.
Let’s zoom to the specific pipeline that failed. Here you can see the specific task named “extract_regional_sales_to_S3” failed the pipeline.
By selecting the failed task, you can see which specific downstream datasets or tasks are impacted with a highlighted red box.
Each time you select a different task, the graph will change which boxes display.
For example, if you select the dataset named “S3 – North America Daily SAP Sales Extract” a lot of red text still remains but the red boxes have changed.
This indicates that the “S3 – North America Daily SAP Sales Extract” dataset only impacts the highlighted red boxes downstream.
You’ll notice that this dataset had no dependencies on a downstream pipeline in the EU or Asia, but does have dependencies in the North America pipeline labeled “na_sentiment_impact_analysis” and the “serve_sales_results_to_bi” pipeline that serves our BI layer.
Quicky debug data incident
And to make debugging easier, you can jump directly to a task from the data lineage graph. Now you can see the error that caused the pipeline to fail.
This allows you to quickly debug errors and resolve them before any downstream impacts occur.
Wrapping it up
For more information on how Databand can help you achieve automated data lineage, check out our demo center or book a demo.