Data is one of today's most relevant resources and comes in many different formats. And data preprocessing is an essential step to turn that data into valuable information that can be interpreted in Machine Learning processes. This article introduces preprocessing concepts and explores several techniques such as data cleaning, reduction, wrangling, editing, etc., to enhance performance and get better analysis.
What is data preprocessing?
By the book, it is the step where data gets transformed to be easily interpreted by the machine learning algorithms. But why transform data?
In Machine Learning, when we think about data, we usually consider a large dataset with many rows and columns.
However, it is never perfect when we talk about Real-World Data (RWD).
Data quantity may never have been so much; however, data doesn’t mean much — the quality depends on what you want the data for and the questions you need to answer.
What can be wrong with data?
Data is usually taken from multiple sources, in different forms such as tabular, images, audio, and is typically not too reliable. So, when we work with RWD, we may find issues like missing values, noise, outliers, duplicate values, inconsistent values, and more. Then it gets hard to get a good analysis without applying any preprocessing techniques.
We start thinking if it is really necessary so much data given that it may not improve the results and can also increase the number of samples with issues already discussed. Yeah, it is essential, and we will see it below.
Data preprocessing
Working with raw data can lead to many difficulties, especially when we have poor data, which can cause algorithms to fail. Quantity and quality are different things, and sometimes less means more; taking the wrong or bad quality data can be detrimental to the analysis.
It is essential to look for inconsistencies, noise, and missing values and check for different variable values that convey the same meaning. For example, a column gender with M and Masculine value both mean the Masculine gender.
We also need to look for suspicious values like a negative price for a house or an impossible value for age, etc.
In addition, we may need to remove duplicate entries, columns that contain too many missing values like more than 50% of entries are missing, and remove the columns that have the same value for all objects; it is just noise for the analysis.
Are you ready to find out more? Let’s get started!
Raw data
For the examples below, we will use the tabular dataset Melbourne Housing Market from Kaggle and the JupyterLab interface to perform the analysis with Python3. The dataset includes address, suburb, rooms, bathroom, area, price, distance from CBD (Central Business District), and more.
In addition, it is an excellent choice to practice the techniques around data preprocessing, e.g., finding the most common issues related to RWD is possible. We will use the information from the Melbourne Housing Market, a tabular dataset, in the pandas tool to explore, clean, and process the data.
In pandas, a data table is called a DataFrame and allows us to manage and take advantage of the data using the concept of rows and columns.
To understand the type of data available and identify potential issues that will be needed to deal with, it is a good practice to describe the data following the example below:
-- CODE language-python line-numbers --
import pandas as pd
df = pd.read_csv('Melbourne_housing.csv')
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.describe(include='all')
https://gist.github.com/leite-paulohf/2d6eef158e7e8097b2ceb29ffb2cea8c
Can you see what is wrong with the data?
Take a look at the top row — this information shows us the most frequent value for each column. As you can see, there are a lot of columns like Rooms, Bathroom, Price, and others where the most frequent value on the column is NaN, which means that we have a missing values issue.
In addition, by comparing the unique and count values, there is a huge gap between them. Could it be an issue related to duplicated values or any inconsistency? So many numbers are available; can we have some outliers? And about the data scale, will we have any trouble with it? Also, understanding what questions are essential to answer will make it easier to decide on which columns should be kept for the analysis.
The exercise proposed here shows how many premises we can consider by looking at the data quickly.
Data cleaning & reduction
Data cleaning is the process of removing or modifying data according to necessity. It's used to avoid issues related to outliers, missing values, irrelevant data, duplicated values, and more.
Data reduction can be applied to reduce the dataset dimensionality or numerosity.
Dimensionality reduction is used to get fewer variables, each mean, fewer columns, and numerosity reduction to get fewer rows.
In this step, it is important to consider what you really need and what you can cope with.
Let's see if we can remove any irrelevant data to understand where is the expensive side of town.
-- CODE language-python line-numbers --
df = df.drop(columns=['Address', 'Method', 'Postcode', 'SellerG', 'Date', 'Lattitude', 'Longtitude'])
df.shape
# output:
# (34857, 14)
As you can see, a dimensionality reduction was applied to remove the data that was considered irrelevant for this analysis. However, it does not mean that it will be the final dataset or that we cannot come back here to revalidate some decisions to improve the results.
For example, to enforce a decision about removing possible irrelevant data, we can verify the linearity between variables, i.e, strongly correlated variables may contribute equally to the analysis, and we can also verify the Principal Components Analysis (PCA).
When we try to create a predictive model from data that can answer a question, we need to understand that a target is needed.
For the exercise proposed here, the Regionname column can be used as a target, e.g., this column contains the answer to its question.
-- CODE language-python line-numbers --
print(df.Regionname.unique())
# output:
# ['Northern Metropolitan', 'Western Metropolitan',
# 'Southern Metropolitan', 'Eastern Metropolitan',
# 'South-Eastern Metropolitan', 'Eastern Victoria',
# 'Northern Victoria', 'Western Victoria', nan]
Considering this, we cannot have NaN values in this column, and for this, the following example can be used to ignore these entries:
-- CODE language-python line-numbers --
import seaborn as sns
# drop nan values
df = df[df.Regionname.notna()]
# plot count chart
sns.set(rc={'figure.figsize':(20,5)})
ax = sns.countplot(x="Regionname", data=df, order=df.Regionname.value_counts().index)
for p, label in zip(ax.patches, df.Regionname.value_counts()
ax.annotate(label, (p.get_x(), p.get_height()))
Check the Regionname column again— the seaborn chart shows that we have an unbalance problem between the targets. It is a vast topic that I will skip for this article. However, feel free to reach out if you would like to know more about it.
For this analysis, I will just drop these values and work with the more 'relevant' data.
-- CODE language-python line-numbers --
# drop element by query
df = df.query("Regionname!='Eastern Victoria' and Regionname!='Northern Victoria' and Regionname!='Western Victoria'")
df.shape
# output:
# (34308, 14)
Let's take a look at the all missing values:
-- CODE language-python line-numbers --
nan = df.isna().sum() * 100 / len(df)
nan.loc[nan > 50]
# output:
# BuildingArea 60.75
# YearBuilt 55.51
# dtype: float64
It was possible to find two columns where more than 50% of entries are missing — the BuildingArea and YearBuilt. For now, we will drop it, and if necessary, we can revalidate this step.
-- CODE language-python line-numbers --
df = df.drop(columns=['BuildingArea', 'YearBuilt'])
df.shape
# output:
# (34308, 12)
In this article, we already talked about suspicious values like a negative price for a house, for example.
The last table shows a land size equal to zero, which is an impossible value for this variable. Considering that we already removed a lot of data, we can simply replace these values with NaN state and deal with this in the Data Imputation process.
-- CODE language-python line-numbers --
import numpy as np
print("Impossible Landsize:", len(df.Landsize[df.Landsize <= 0]))
df.Landsize = df.Landsize.replace(0, np.nan)
# output:
# Impossible Landsize: 2436
Summarizing what we've seen so far, we applied:
- Dimensionality reduction to remove irrelevant and unbalanced data.
- Numerosity reduction to remove missing and suspicious values.
To finalize the step, we can separate the data between features and targets.
Features (X) will be responsible for creating the predictive model, and Target (y) will be the answer for each entry.
-- CODE language-python line-numbers --
X = df.drop(columns=['Regionname'])
y = df.Regionname
Data transformation
The data transformation technique is helpful to deal with issues related to data with different scales, types, etc.
Some methods used in this step are normalization, standardization, discretization, or binarization. And they all change the data without losing the original characteristic:
- Normalization: to rescale the values into a range of [0,1].
- Standardization: to rescale data to have a mean of 0 and a standard deviation of 1.
- Discretization: to convert a continuous value to discretized or nominal value.
- Binarization: to convert a value or range of values into true and false values.
For this dataset, we have values with different magnitudes and types (text, number, and decimal number).
The idea is to apply a method to rescale the data and factorize all categorical or non-numeric features between 0 and N, where N is the number of unique values.
Let's see how it works in practice:
-- CODE language-python line-numbers --
for i in X:
if X.dtypes[i] == 'object':
X[i] = pd.factorize(X[i])[0].astype(float)
X[i] = X[i].replace({-1: np.nan})
We applied the process to factorize categorical or non-numeric features between 0 and N.
To rescale the data is possible to use both — standardization or the normalization method.
However, standardization considers the standard deviation that can better handle the dataset with outliers than the normalization method.
To verify if the dataset contains outliers, we can use the Interquartile Range (IQR), which means that everything too distant from IQR can be considered an outlier.
-- CODE language-python line-numbers --
for i in X:
sorted(X[i])
Q1 = X[i].quantile(0.25)
Q3 = X[i].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - (1.5 * IQR)
upper = Q3 + (1.5 * IQR)
out = X[(X[i] > upper) | (X[i] < lower)]
p = 100* (len(out) / len(X[i]))
if p > 1:
print('Outiler in {}: {:.2f}%'.format(i, p))
# output:
# Outiler in Price: 3.70%
# Outiler in Distance: 2.80%
# Outiler in Bathroom: 1.05%
# Outiler in Car: 4.25%
# Outiler in Landsize: 1.78%
# Outiler in Propertycount: 2.46%
There are some outliers in this dataset that can be handled by the standardization method.
To perform this, we will use the preprocessing from sklearn, an open-source tool for predictive data analysis.
-- CODE language-python line-numbers --
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)
Data Imputation
The data imputation process consists of replacing missing data with substituted values.
There are many ways to perform this, and the most common techniques are:
- Replace by a global constant: it is easy but will impute high bias and may affect inference.
- Replace by mean, median, or mode: it is easy to get the most likely value; however, it will distort the distribution.
- Replace by neighbors or linear regression: it is a good way to replace the data, varied values imputed, but needs processing, depends on distance measure and parameters.
Let's try to use the nearest neighbors:
-- CODE language-python line-numbers --
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
data = imputer.fit_transform(X)
X = pd.DataFrame(data=data, columns=X.columns)
The method KNNImputer replaced the mean value from n nearest neighbors for each missing value, in this case, 5.
Because KNN uses the distances between pairs of samples, and the measurement units also influence these distances, it is important to use this method only after data transformation to avoid bad predictions.
Results so far
We already have a dataset without missing values, duplications, outliers, etc.
Compared with the initial raw data, we have a much better dataset to create a predictive model. However, the dataset may still have data that can be removed, e.g., if two or more variables are strongly correlated, it may imply that they contribute equally to the analysis, meaning these features are redundant.
We can use the Pearson or Spearman correlation to calculate the correlations between data.
- Pearson method: to evaluate the linear relationship between two continuous variables.
- Spearman method: to evaluate the non-linear relationship, the coefficient is based on the ranked values for each variable.
We used the Spearman correlation in the following exercise:
-- CODE language-python line-numbers --
import matplotlib.pyplot as plt
plt.figure(figsize=(15,10))
corr = X.corr(method ='spearman')
sns.heatmap(corr, annot=True, cmap=plt.cm.Reds)
plt.show()
The correlogram shows a high correlation between some variables, which are:
- Distance with CouncilArea and Suburb: it is interesting and helps understand that the farther away from the central business district, the bigger is the council area.
- Bedroom2 with Rooms, Bathroom, Price, and Car: a similar analysis can be made with this correlation, where the price is directly connected to the number of rooms. In addition, the correlogram shows that we may remove some of these values from the analysis.
Looking at the correlation between Bedroom2, Rooms, Bathroom, Price, and Car, how can we know the best feature to remove?
The method of Principal Component Analysis (PCA) can be used to support the data reduction suggested by the analysis of the correlogram.
The PCA method will show the contribution of each feature in the analysis. It is a technique to reduce the data, increase interpretability and minimize information loss.
-- CODE language-python line-numbers --
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)
for idx, i in enumerate(X):
p = pca.explained_variance_ratio_[idx]
print('PCA for {}: {:.2f}%'.format(i, 100*p))
# output:
# PCA for Suburb: 32.01%
# PCA for Rooms: 22.30%
# PCA for Type: 9.57%
# PCA for Price: 8.83%
# PCA for Distance: 6.35%
# PCA for Bedroom2: 5.92%
# PCA for Bathroom: 4.45%
# PCA for Car: 3.93%
# PCA for Landsize: 3.24%
# PCA for CouncilArea: 2.97%
# PCA for Propertycount: 0.43%
CouncilArea can be removed, it is redundant and does not contribute a lot to the analysis.
The feature Propertycount is not redundant but is irrelevant. Even though Bedroom2 has the highest correlation, it contributes well to the analysis, so we will maintain this feature and remove the Car column.
-- CODE language-python line-numbers --
X = X.drop(columns=['Car', 'Bathroom', 'CouncilArea', 'Propertycount'])
X.head()
https://gist.github.com/leite-paulohf/eaf64436b05c82d170d316396affd4b3
Let's try to use the current dataset applied to a Logistic Regression method. We will use a technique to split the dataset into train and test for the next exercise.
It is crucial to evaluate the model and how near or far we are to predict well. We will use the f1 score to measure the model accuracy as a metric.
-- CODE language-python line-numbers --
from sklearn.model_selection import train_test_split as split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score as f1
X_train, X_test, y_train, y_test = split(X, y, test_size=0.2)
LogR = LogisticRegression()
LogR.fit(X_train, y_train)
y_pred = LogR.predict(X_test)
print('Score:', f1(y_test, y_pred, average='micro')*100)
# output:
# 52.14223258525211
Let's try to skip the preprocessing and see what happens:
-- CODE language-python line-numbers --
df = pd.read_csv('Melbourne_housing.csv')
X = df.drop(columns=['Regionname'])
y = df.Regionname
X_train, X_test, y_train, y_test = split(X, y, test_size=0.2)
try:
LogR = LogisticRegression()
LogR.fit(X_train, y_train)
y_pred = LogR.predict(X_test)
print('Score:', f1(y_test, y_pred, average='micro')*100)
except ValueError:
print('could not convert string to float: Preston')
# output:
# could not convert string to float: Preston
Data preprocessing is crucial for any machine learning process. After all steps and techniques, the data can finally be easily interpreted by the algorithms.
When we tried to use the raw data directly, the algorithm failed. Even if we try to fix the error, another one will appear, and we will lose the data interpretability.
Looking for all methods and techniques introduced here, is it possible to improve this score?
I do not doubt that the answer for this is, perhaps for part 2, when we can talk more about predictive models, metrics, tuning, supervised vs. unsupervised, and more.
. . .
In this article, I wanted to introduce the main concepts around data preprocessing in practice which is a must step in the Machine Learning process. If you have any feedback or want to see other articles like that, feel feel to reach out. That's all, folks!
References
- Han, Kamber & Pei, Data Mining Concepts and Techniques, Morgan Kaufman.
- García-Laencina, P.J., Sancho-Gómez, J. & Figueiras-Vidal, A.R. Pattern classification with missing data: a review. Neural Comput & Applic 19, 263–282 (2010).
- Pavel Horbonos, A brief guide to data imputation with Python and R: Make the data clean, Towards Data Science (2020)
- Melbourne Housing Market by Kaggle
- Dataset transformations by Scikit Learn