🤩 New Cool Developer Tools for you. Explore →

FREE JavaScript Video Series Start Learning →

FLAT 75% OFF All Interactive courses at flat ₹250 / $3.25 only. HURRRRRY!! Explore now

Dark Mode On/Off

Interactive Learning

C Language course

GO Lang course

Learn JavaScript

Learn HTML

Learn CSS

C Language

C Tutorial

C Programs (100+)

C Compiler

Execute C programs online.

C++ Language

C++ Tutorial

Standard Template Library

C++ Programs (100+)

C++ Compiler

Execute C++ programs online.

Python

Python Tutorial

Python Projects

Python Programs

Python How Tos

Numpy Module

Matplotlib Module

Tkinter Module

Network Programming with Python

Learn Web Scraping

Handling Categorical Data in Python

Posted in Machine Learning LAST UPDATED: OCTOBER 28, 2021

In our previous article, we covered how we can handle missing values in a given dataset in python to make the dataset good enough for machine learning algorithms. But handling empty values in a dataset is not enough for machine learning algorithms.

So far, we have only been working with numerical values. However, it is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features.

Ordinal features can be understood as categorical values that can be sorted or ordered. For example, T-shirt size would be an ordinal feature, because we can define an order XL > L > M.

In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of T-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.

Before we explore different techniques to handle such categorical data, let's create a new data frame to illustrate the problem:

import pandas as pd

df = pd.DataFrame([

    ['green', 'M', 10.1, 'class1'],

    ['red', 'L', 13.5, 'class2'],

    ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']

print(df)

Output:

color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1

As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column.

Mapping ordinal features

To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature.

Thus, we have to define the mapping manually. In the following simple example, let's assume that we know the difference between features, for example, XL = L + 1 = M + 2.

size_mapping = {

    'XL': 3,

    'L': 2,

    'M': 1}

df['size'] = df['size'].map(size_mapping)

print(df)

Output:

color size price classlabel
0 green 1 10.1 class1
1 red 2 13.5 class2
2 blue 3 15.3 class1

If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas' map method on the transformed feature column similar to the size_mapping dictionary that we used previously.

Encoding class labels

Many machine learning libraries require that class labels to be encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered a good practice to provide class labels as integer arrays to avoid technical glitches.

To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string-label. Thus, we can simply enumerate the class labels starting from 0:

Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn module to achieve the same:

from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
print(y)

Output:

array([0, 1, 0])

Performing one-hot encoding on nominal features

We used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators treat class labels without any order, we used the convenient LabelEncoder class to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows:

X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()

X[:, 0] = color_le.fit_transform(X[:, 0])

print(X)

Output:

array([[1, 1, 10.1], [2, 2, 13.5], [0, 3, 15.3]], dtype=object)

After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows:

• blue -> 0

• green -> 1

• red -> 2

If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal.

A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of a sample, for example, a blue sample can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in the scikit-learn.preprocessing module:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])

ohe.fit_transform(X).toarray()

Output:

array([[ 0. , 1. , 0. , 1. , 10.1], [ 0. , 0. , 1. , 2. , 13.5], [ 1. , 0. , 0. , 3. , 15.3]])

When we initialized the OneHotEncoder, we defined the column position of the variable that we want to transform via the categorical_features parameter (note that color is the first column in the feature matrix X). By default, the OneHotEncoder returns a sparse matrix when we use the transform method, and we converted the sparse matrix representation into a regular (dense) NumPy array for the purposes of visualization via the toarray method.

Sparse matrices are simply a more efficient way of storing large datasets, and one that is supported by many scikit-learn functions, which is especially useful if it contains a lot of zeros. To omit the toarray step, we could initialize the encoder as OneHotEncoder(…,sparse=False) to return a regular NumPy array.

An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied on a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged:

pd.get_dummies(df[['price', 'color', 'size']])

Output:

price size color_blue color_green color_red
0 10.1 1 0 1 0
1 13.5 2 0 0 1
2 15.3 3 1 0 0

So in this article, we not only learned about how to deal with missing data in a dataset being used for machine learning but we also covered the part of converting the data into a meaningful set which is easier for the machine learning algorithms to process.

C TUTORIAL

C PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

C++ TUTORIAL

C++ PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

PYTHON TUTORIAL

PYTHON HOW TOS

INTERVIEW TESTS

EXECUTE CODE

JAVA TUTORIAL

JAVA CODE EXAMPLES

SPRING TUTORIAL

MORE IN JAVA

COMPUTER ARCHITECTURE

COMPUTER NETWORK

OPERATING SYSTEM

DBMS & SQL

PL/SQL

MongoDB

EXECUTE SQL

ANDROID DEVELOPMENT

GO LANGUAGE

LINUX

DOCKER

HTML TAGS (A to Z)

CSS REFERENCES

SASS/SCSS

KOTLIN

GAME DEVELOPMENT

PHP

GIT GUIDE

JAVASCRIPT

ADVANCED DSA

Handling Categorical Data in Python

Table of Contents

Mapping ordinal features

Encoding class labels

Performing one-hot encoding on nominal features

You may also like:

IF YOU LIKE IT, THEN SHARE IT

RELATED POSTS