What is Data Profiling?

Abhilash Marichi
3 min readFeb 27, 2022

--

In this post, I am going to explain what is data profiling? with simple examples. I hope it would be of some help to you.

Photo by Vitaly Vlasov from Pexels

Definition

Data Profiling is the process of understanding the data, its characteristics and quality.

How to do it?

First and foremost before starting the data profiling process we need to finalize the sources that are going to be used in the project. Usually, this would be based on requirements at the target/data warehouse or in the Business Intelligence Report.

Now each of the tables in the source must be profiled. Prioritize your tables in each source based on the criticality or usage and start profiling each one of them.

Data Profiling Reports usually captures the following —

At Column Level,

Statistics

  • Minimum
  • Maximum
  • Mean
  • Mode
  • Percentile
  • Standard Deviation
  • Frequency — How many times a value repeats
  • Variation
  • Sum

Metadata

  • Data Type
  • Length
  • Uniqueness / Duplicates
  • Occurrence of Null values
  • Typical String Patterns

At Table Level,

  • Primary key Analysis
  • Foreign Key Analysis
  • Referential Integrity Analysis

Let’s look into a sample Data Profiling Report from IBM Infosphere Information Analyzer.

The below report shows columns (address, au_firstname) length analysis in the table (authors)

Sample Column Inferences Analysis Report

The below report shows the columns (CarrId, CarrName) Nullability Analysis in the table (Carrier).

Sample Nullability Difference Report

There are many tools that could be used to perform the Data Profiling or you can simply write SQLs and perform the analysis if the source is a database. If you have heterogeneous sources including Excel Files, Text Files, etc. then I recommend using a tool that is efficient and it could help you to speed up the process of Data Profiling.

When to do Data Profiling?

Always perform one round of data profiling before designing the data model for the Data Warehouse because it could change your understanding of the data.

Also, data profiling is not one time process but a continuous process, the developers must profile the data and understand the nature of it before starting to code the ETLs, by understanding data a developer can figure out the anomalies and resolve it early and also could find the inconsistencies in data and perform required transformations to standardize them.

Conclusion

Data Profiling is the first and critical piece of any Data Integration, Data Warehousing, or Data Migration project. Ensure to always understand your data with various Data Profiling methods before using it. It would save you a lot of time, effort, and money. It increases the trust in data among the business decision-makers and helps them to make the right and effective decisions.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

I hope in this short blog post you got a little bit of understanding and got to know the importance of Data Profiling.

Let me know if this helped you by clapping or by dropping a comment below.

Take care, stay safe and meet you in the next one. Bye!

--

--

Abhilash Marichi

Data Engineer at Amazon. I write about Data, Product & Life.