Importance of Data Cleaning

Mehrzad M.
2 min readApr 4, 2021

--

There is a chance that during working either on a Spreadsheet or a dataset in any format, your desired functions or formulas haven’t work. You go through StackOverflow, searching throughout Google and even try to re-learn the concepts you know to figure out why what is supposed to work, doesn’t.

A same predicament happened to me during last week as I was trying to work through R Bootcamp project, merging two distinct datasets CPI dataset from Transparency International and HDI dataset from United Nations Development Programme.

The project had nothing to do with the problem I face, since I could easily use the provided dataset and “get on with it”, but personal curiosity made me gather up-to-date data in order to answer my own questions risen from the project.

Preview of CPI 2020 Dataset
Preview of HDI 1990–2020 Dataset

A simple VLOOKUP should have done the trick to merge two separate worksheets on ‘Country’ column, right? Turned out it didn’t! I tried to refresh my knowledge to merging data, as I thought I was missing something or forgot how to make it work, but again, that wasn’t the issue.

At a first glance you might not notice it, but there’s an extra space in front of ‘country’ records. This caused VLOOKUP function to result NA over and over again. Simple TRIM function on the ‘Country’ column did the trick.

Most of the times we trust data from governmental websites or well-known sources, e.g. completeness, consistency, uniformity, etc., but it doesn’t mean you shouldn’t clean your data.

--

--

Mehrzad M.

I write about Data Analytics / Data Science, and my experience as a data enthusiast.