### Important caveats to be kept in mind when encoding data with `pandas.get_dummies()`

Handling categorical variables forms an essential component of a machine learning pipeline. While machine learning algorithms can naturally handle the numerical variables, the same is not valid for their categorical counterparts. Although there are algorithms like LightGBM and Catboost that can inherently handle the categorical variables, it is not the case with most other algorithms. These categorical variables have to be first converted into numerical quantities to be fed into the machine learning algorithms. There are many ways to encode categorical variables like one-hot encoding, ordinal encoding, label encoding, etc. but this article looks at pandas’ dummy variable encoding and exposes its potential limitation.

## Categorical variables — a quick intro

A **variable** whose value ranges over categories is called a categorical variable such as gender, hair color, ethnicity, zip codes, or social security number. The sum of two zip codes or social security numbers is not meaningful. Similarly, the average of a list of zip codes doesn’t make sense. Categorical variables can be divided into two subcategories based on the kind of elements they group:

**Nominal variables**are those whose`1`

for the red color and`2`

for blue. But these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average. Examples that fit in this category are gender, postal codes, hair color, etc.**Ordinal**variables have an inherent order which is somehow significant. An example would be tracking student grades where`Grade 1 > Grade 2 > Grade 3`

. Another example would the socio-economic status of people where be the`“high income” > “low income”`

.

## Encoding categorical variables with ``pandas.get_dummies()``

Now that we know what categorical variables are, it becomes clear that we cannot use them directly in machine learning models. They have to be converted into meaningful numerical representations. This process is called encoding. There are a lot of techniques for encoding categorical variables, but we will specifically look at the one provided by the pandas’ library called `get_dummies()`

.

As the name suggests, the `pandas.get_dummies()`

function converts categorical variables into dummy or indicator variables. Let’s see it working through an elementary example. We first define a hypothetical dataset consisting of attributes of employees of a company and use it to predict the employees’ salaries.https://towardsdatascience.com/media/4568549c3cde30552e0073766fff6cc6

Our dataset looks like this:

df

We can see that there are two categorical columns in the above dataset i.e. `Gender`

and `EducationField`

. Let’s encode them into numerical quantities using `pandas.get_dummies()`

which returns a dummy-encoded dataframe.

pd.get_dummies(df)

The column `Gender`

gets converted into two columns — `Gender_Female`

and `Gender_Male`

having values as either zero or one. For instance, `Gender_Female`

has a `value = 1`

at places where the concerned employee is female and `value = 0`

when not. The same is true for the column `Gender_Male`

.

Similarly, the column `EducationField`

also gets separated into three different columns based on the field of education. Things are pretty much apparent till now. However, the issue begins when we use this encoded dataset to train a model.

## The Dummy Variable Trap

Let’s say we want to use the given data to build a machine learning model that can predict employees’ monthly salaries. This is a classic example of a regression problem where the target variable is `MonthlyIncome.`

If we were to use `pandas.get_dummies()`

to encode the categorical variables, the following issues could arise:

### 1️⃣. The issue of Multicollinearity

Note: The above diagram explains multicollinearity very intuitively. Thanks to Karen Grace-Martin for explaining the concept in such a lucid manner. Refer the link below to go to the article.

One of the assumptions of a regression model is that the observations must be independent of each other. **Multicollinearity** occurs when independent variables in a **regression model** are correlated. So why is correlation a problem? To help you understand the concept in detail and avoid re-inventing the wheel, I’ll point you to a great piece by **Jim Frost**, where he explains it very succinctly. The following paragraph is from the same article.

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant.

If all the variables are correlated, it will become difficult for the model to tell how strongly a particular variable affects the target since all the variables are related. In such a case, the coefficient of a regression model will not convey the correct information.

### Multicollinearity issue with pandas.get_dummies

Consider the employee example above. Let’s isolate the `Gender`

column from the dataset and encode it.

If we look closely, `Gender_Female`

and `Gender_Male`

columns are multi-collinear. This is because a value of `1`

in one column automatically implies `0`

in the other. This issue is termed a dummy variable trap and can be represented as :

Gender_Female = 1 - Gender_Male

### Solution: Drop the first column

Multi-collinearity is undesirable, and every time we encode variables with `pandas.get_dummies(),`

we’ll encounter this issue. One way to overcome this issue is by dropping one of the generated columns. So, we can drop either `Gender_Female`

or `Gender_Male `

without potentially losing any information. Fortunately, `pandas.get_dummies()`

has a parameter called `drop_first`

which, when set to `True`

, does precisely that.

pd.get_dummies(df, drop_first=True)

We’ve resolved multicollinearity, but another issue lurks when we use dummy_encoding, which we will look at in the next section.

### 2️⃣. Column Mismatch in train and test sets

To train a model with the given employee data, we’ll first split the dataset into train and test sets, keeping the test set aside so that our model never sees it.

from sklearn.model_selection import train_test_splitX = df.drop('MonthlyIncome', axis=1)

y = df['MonthlyIncome']X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

The next step would be to encode the categorical variables in the training set and the test set.

**Encoding Training set**

pd.get_dummies(X_train)

As expected, both the `Gender`

and the `EducationField`

attributes have been encoded into numerical quantities. Now we’ll apply the same process to the test dataset.

**Encoding Test set**

pd.get_dummies(X_test)

Wait! There is a column mismatch in the training and test set. This means the number of columns in the training set is not equal to the ones in the test set, and this will throw an error in the modeling process.

### Solution 1: `Handle unknown by using .reindex and .fillna()`

One way of addressing this mismatch in categories would be to save the columns obtained after dummy encoding the training set in a list. Then, encode the test set as usual and use the columns of the encoded training set to align both the datasets. Let’s understand it through code:

# Dummy encoding Training set

X_train_encoded = pd.get_dummies(X_train)# Saving the columns in a list

cols = X_train_encoded.columns.tolist()# Viewing the first three rows of the encoded dataframe

X_train_encoded[:3]

Now, we’ll encode the test set followed by realigning the training and test columns and filling in all missing values with zero.

X_test_encoded = pd.get_dummies(X_test)

X_test_encoded = X_test_encoded.reindex(columns=cols).fillna(0)

X_test_encoded

As you can see, now both the datasets have the same number of columns,

### Solution 2: Use **One Hot Encoding**

**One Hot Encoding**

Another solution and a preferable one would be to use `sklearn.preprocessing.OneHotEncoder().`

Additionally, one can use `handle_unknown="ignore"`

to solve the potential issues due to rare categories.

#One hot encoding the categorical columns in training setfrom sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

train_enc = ohe.fit_transform(X_train[['Gender','EducationField']])#Converting back to a dataframe

pd.DataFrame(train_enc, columns=ohe.get_feature_names())[:3]

# Transforming the test settest_enc = ohe.fit_transform(X_test[['Gender','EducationField']])#Converting back to a dataframe

pd.DataFrame(test_enc,columns=ohe.get_feature_names())

Note, you can also drop one of the categories per feature in OnehotEncoder by setting the parameter `drop=’if_binary’`

. Refer to the documentation for more detail.

## Conclusion and Takeaways

This article looked at how pandas’ can be used to encode categorical variables and the common caveats associated with it. We also looked in detail at the plausible solutions to avoid those pitfalls. I hope this article has given you intuition into what a dummy variable trap is and how it can be avoided. Also, the two articles referenced in this post are a great reference, especially if you want to go deeper into issues related to multicollinearity. I highly recommend them.

**Originally published here**