Data Lineage — Simple explanation
If you are new to the data world and heard people talking about Data Lineage and wondering what it is then in this blog post, I am trying to explain what is Data Lineage and where it is used and how it helps.
Definition
Data Lineage is
- where data begins,
- what happens to it over time,
- where it moves over time,
- and where it ends up.
Let me explain it to you with an example —
Assume that there is a chain of 10 stores and there is a Business Intelligence report that shows the total number of sales across all the stores in a day.
Now we total sales in the sum of sales across all the stores.
total sales = Total Sales in store 1 + Total Sales in-store 2+ …. + Total sales in-store 10.
In the real world, there could be a data warehouse that captures all the sales information for each day in a Fact table called Sales_Fact.
So Sales_Fact will have data stored in the below format
The sum of all total_sale_price would be 1,840 which will be shown in the BI report.
So from where do these data land in Sales_fact? It could be from 10 different staging tables assuming one stage table per store.
And how these stage tables are loaded? It is from the OLTP Database that is used by each of these stores.
Well, we just found the lineage of total_sales data.
It would be depicted by the below picture. It can go much deeper and it would include all the transformation that has happened on the data etc. But for simplicity's sake, I am showing only the flow.
Why is it needed?
Commonly data lineage would be needed whenever there is a change in the source system or if there are any changes required at any stage of data we will be able to track where exactly it is going to affect.
Using data lineage, testers and QAs can actually find where exactly the data needs to be tested and where only regression testing is required.
It helps in auditing and compliance analysis of the system to ensure that the data has not been tampered with.
Conclusion
Some projects tend to ignore the tools that are necessary for capturing the data lineage while building a data warehouse and usually rely on the ETL developers heavily and usually, this is an overhead for the developers. Using a tool that could provide the data lineage for the business needs will save time and effort. It will also increase the speed of debugging process.
I hope in this short blog post, I have tried to explain in simple terms what data lineage is and how it helps.
If you are using any tool for the Data Lineage process in your project do let me know in the comment.
Take care, stay safe and meet you in the next one.