Blog > Data Journalism > What is data cleaning?

What is data cleaning?

We told you the essential elements to organize and clean your data, the Tidy structure, and why we recommend you use it.

Available in:

Español

Data cleaning cover

By Nicolás Barahona. Published: August 10, 2021

Data cleaning consists of preparing your data for proper use and analysis.

How to clean data?

The first step to cleaning your data is to have the correct format. Ideally, the data should be in Excel or CSV. In addition, it should not be an image or other that cannot be easily reused, such as PDF.

If you want to extract information from an image, like a photo of the vaccination table published by your government, you can use free tools like www.onlineocr.net. In the case of PDF, you can use our app and, in just four steps, obtain the text that you can’t manipulate before.

After you have the information in the correct format, you must organize it, especially when you face the two more frequent cases.

The first is when you have to work with a database that you did not build. Here are some examples:

Your country’s foreign trade authority gives you a data table on the most exported foods in the last five years.
A human rights organization consolidates information on child abuse during the Covid-19 pandemic.
Your boss gives you an Excel document with the organization’s acquisitions and sales.
You have downloaded the Billboard Hot 100 chart from Wikipedia.

The second one is you are building a database from zero.

For both cases, it is elemental to identify the variables that make up the information.

Let’s take the example of the foreign trade authority data table. Assume this is the information you have:

Product	Banana	Coffee	Rice	Apples	Pears
Tons	120	100	82	32	15
Main destination	United States	The United Kingdom	France	United States	Belgium

Here you have three variables or categories: Product, Tons, and Main Destination. We know they are because, beyond the name, they give meaning to the information in the table. They have only one problem: they are not organized properly.

Tidy Data structure

The best structure that you have to use is Tidy. It consists of each variable or category be a column, and each piece of information be located in a row.

Applying it, the previous example would remain like this:

Product	Tons	Main destination
Banana	120	United States
Coffee	100	The United Kingdom
Rice	82	France
Apples	32	United States
Pears	15	Belgium

Here we see that the categories are in the column’s beginning, are not modified, and serve as a guide to locating a determined type of information. That allows to filter the data, analyze it better, and visualized it correctly.

Let’s see how a table could be prepared if you were creating a database from zero. Think you are a historian, and you are researching the principal political groups from a region.

The information you collected has different attributes like the group’s name, leader, and ideology. You decided to put it in columns while located the groups in the rows. Front the name. You filled the cells like this:

Name	Communist	Socialist	Fascist	Leader
Blue		x		Juan
Green	x			Pedro
Orange		x		Alberto

When viewed, the information displayed in this way can be confusing, and the process of filtering and viewing becomes difficult.

It changes a lot with the Tidy structure.

Name	Ideology	Leader
Blue	Socialist	Juan
Green	Communist	Pedro
Orange	Socialist	Alberto

What we have just do applies to complex and straightforward cases, like organize the public contracting records that, as a minimum, have more than ten variables.

Once we use the Tidy structure, it is necessary to standardize the values. That is, the information in the rows must follow the same format. For example, if you have a variable with the gender of people, you must decide how the information will be presented. You cannot put in one row the letter F, in another the word “Male,” and in another “Female.” Another case, with weights. You cannot put Kg and “Kilograms.” You must choose only one option because although they mean the same thing, they mean different things for computers.

A few last tips

With this explanation, you can start cleaning your databases. Now we leave you some final recommendations for you to apply in the process.

The column headings should occupy a single row.
Each variable should be in only one column. For example, you should record in one the first name, the other the last name, and the other the age of a person.
Each column must have the same data type. If you talk about names, dates, ages, or money, they have different data types. The first is an alphabetic data type, the second is date formatted, and the last two have numeric data. You can set this up in Excel or Google Sheets.
Each data record is a row. For example, if you have the price of a product, it is ideal to have the number in one column and the type of currency in the other. Thus, each record should make the table grow with new rows and not with new columns.
There should be no hidden variables in the cells.

Remember that by having the data clean and tidy, you can interrogate and solve questions with it. That gives you a vast number of possibilities to explore and explain your world.

This post is based on a very detailed Datasketch guide that we will be publishing very soon, stay tuned!

Links of interest

In these four videos, you can learn more about Tidy Data, databases' principles, and how to clean them with Google Sheets.

BSG Institute. January 10, 2020. Video: What is Tidy Data?. (en)

https://www.youtube.com/watch?v=kt8qwxVOvTo
Denys07M. November 9, 2017. Video: Principles of Tidy Data. (en)https://www.youtube.com/watch?v=oQuupzfX9OQ
Hadley Wickham. Tidy Data. Journal of Statistical Software. (en) https://vita.had.co.nz/papers/tidy-data.pdf
Knight Center Courses. October 15, 2019. Video: Data Preparation - Cleaning Data with Google Sheets.(en) https://www.youtube.com/watch?v=YnN2kUWIr48