What is data cleaning? | Datasketch
Logo Datasketch
Blog > Data Journalism > What is data cleaning?

What is data cleaning?

We told you the essential elements to organize and clean your data, the Tidy structure, and why we recommend you use it.

Available in:

Data cleaning cover

By Nicolás Barahona. Published: August 10, 2021

Data cleaning consists of preparing your data for proper use and analysis.

How to clean data?

The first step to cleaning your data is to have the correct format. Ideally, the data should be in Excel or CSV. In addition, it should not be an image or other that cannot be easily reused, such as PDF.

If you want to extract information from an image, like a photo of the vaccination table published by your government, you can use free tools like www.onlineocr.net. In the case of PDF, you can use our app and, in just four steps, obtain the text that you can’t manipulate before.

After you have the information in the correct format, you must organize it, especially when you face the two more frequent cases.

The first is when you have to work with a database that you did not build. Here are some examples:

  • Your country’s foreign trade authority gives you a data table on the most exported foods in the last five years.
  • A human rights organization consolidates information on child abuse during the Covid-19 pandemic.
  • Your boss gives you an Excel document with the organization’s acquisitions and sales.
  • You have downloaded the Billboard Hot 100 chart from Wikipedia.

The second one is you are building a database from zero.

For both cases, it is elemental to identify the variables that make up the information.

Let’s take the example of the foreign trade authority data table. Assume this is the information you have:

Product Banana Coffee Rice Apples Pears
Tons 120 100 82 32 15
Main destination United States The United Kingdom France United States Belgium

Here you have three variables or categories: Product, Tons, and Main Destination. We know they are because, beyond the name, they give meaning to the information in the table. They have only one problem: they are not organized properly.

Tidy Data structure

The best structure that you have to use is Tidy. It consists of each variable or category be a column, and each piece of information be located in a row.

Applying it, the previous example would remain like this:

Product Tons Main destination
Banana 120 United States
Coffee 100 The United Kingdom
Rice 82 France
Apples 32 United States
Pears 15 Belgium

Here we see that the categories are in the column’s beginning, are not modified, and serve as a guide to locating a determined type of information. That allows to filter the data, analyze it better, and visualized it correctly.

Let’s see how a table could be prepared if you were creating a database from zero. Think you are a historian, and you are researching the principal political groups from a region.

The information you collected has different attributes like the group’s name, leader, and ideology. You decided to put it in columns while located the groups in the rows. Front the name. You filled the cells like this:

Name Communist Socialist Fascist Leader
Blue x Juan
Green x Pedro
Orange x Alberto

When viewed, the information displayed in this way can be confusing, and the process of filtering and viewing becomes difficult.

It changes a lot with the Tidy structure.

Name Ideology Leader
Blue Socialist Juan
Green Communist Pedro
Orange Socialist Alberto

What we have just do applies to complex and straightforward cases, like organize the public contracting records that, as a minimum, have more than ten variables.

Once we use the Tidy structure, it is necessary to standardize the values. That is, the information in the rows must follow the same format. For example, if you have a variable with the gender of people, you must decide how the information will be presented. You cannot put in one row the letter F, in another the word “Male,” and in another “Female.” Another case, with weights. You cannot put Kg and “Kilograms.” You must choose only one option because although they mean the same thing, they mean different things for computers.

A few last tips

With this explanation, you can start cleaning your databases. Now we leave you some final recommendations for you to apply in the process.

  • The column headings should occupy a single row.
  • Each variable should be in only one column. For example, you should record in one the first name, the other the last name, and the other the age of a person.
  • Each column must have the same data type. If you talk about names, dates, ages, or money, they have different data types. The first is an alphabetic data type, the second is date formatted, and the last two have numeric data. You can set this up in Excel or Google Sheets.
  • Each data record is a row. For example, if you have the price of a product, it is ideal to have the number in one column and the type of currency in the other. Thus, each record should make the table grow with new rows and not with new columns.
  • There should be no hidden variables in the cells.

Remember that by having the data clean and tidy, you can interrogate and solve questions with it. That gives you a vast number of possibilities to explore and explain your world.

This post is based on a very detailed Datasketch guide that we will be publishing very soon, stay tuned!

In these four videos, you can learn more about Tidy Data, databases' principles, and how to clean them with Google Sheets.