Basics of Data Cleansing.

Abhilash Marichi
3 min readApr 24, 2021

--

Most data engineers spend about 50% or more time preparing the data. This is the boring, time-consuming chore that a data engineer needs to do, no matter what! To ensure that they end up with the dependable result at the end.

Why is it important?

In the real world, data is not handled well. When we put data together from multiple systems, they would come in different formats, there would be junk data that makes no sense and unusual special characters, etc.

To make data usable, it must be cleaned first. Once the data is cleaned then you can start trusting the data. If the data cleaning step is skipped then your project might just end up as a disaster. So never ever skip or rush through this cleansing process.

What are the common data cleansing activities?

  1. Cleaning the data of invalid type

Let’s say there is a column that stores Age in your system and usually, it is an integer column and holds the value like ‘3’ but in some systems, Age might be stored as a string like “3 Years 10 Months”. So, in your target system, if you plan to use Integer then you will have to clean the age to pull only integer value.

2. Ensuring the data is in range

Let’s say there is a table that stores the details of your current customers who made an online purchase this month and you store their date of birth in the system. If you find a date of birth which is 07/04/1918 even though if the data matches your data type this could be an invalid value because hardly anyone lives for over a century! But if it's Captain America then you may have to handle the special case separately!

Captain America

3. Unique and Not Nulls

Ensuring that important columns that cannot be Null have data for all the records and columns like IDs that are supposed to be unique should not have duplicates. This issue occurs usually if the source from which you are reading data is a file or table that does not implement the constraint that is required.

4. Foreign Keys

Ensure to validate the foreign key fields or the dependent member fields as per the business.

5. Data Patterns

A simple example for this would be a Phone number, there might be a standard pattern set for phone numbers in your system like +1 (555) 238-7985, and all the values that are entered in this field must match the defined pattern.

6. Business Rules

Each business would have some rules and some real-world scenarios that could never occur and data must be validated against those business rules. For example: If you are capturing the weekly logged hours of each of your employees and you find that some employees have worked more than 50 hours and your company policy does not allow employees to work more than 50 hours a week then it’s a data anomaly.

Clean data always create a huge positive impact downstream. Accurate analysis or reports leads to the best business decisions and that would lead to the growth of business and in turn more profit to your company and in turn, you might get a salary hike! :) So it is very important! Do it with care! :)

I hope this post has helped you in some way! Do let me know if you liked it with a clap! :) Bye!

--

--

Abhilash Marichi

Data Engineer at Amazon. I write about Data, Product & Life.