How to Get Dummies in Pandas

Suraj Joshi Feb 02, 2024
  1. pandas.get_dummies() Method
  2. Create DataFrame With Dummy Variable Columns Using pandas.get_dummies() Method
  3. Set columns to Create Dummy Variables for Specified Columns Only
  4. Set prefix to Change the Default Name of Dummy Columns
How to Get Dummies in Pandas

This tutorial explains how we can generate DataFrame with dummy or indicator variables from DataFrame with categorical columns.

pandas.get_dummies() Method

pandas.get_dummies(
    data,
    prefix=None,
    prefix_sep="_",
    dummy_na=False,
    columns=None,
    sparse=False,
    drop_first=False,
    dtype=None,
)

Create DataFrame With Dummy Variable Columns Using pandas.get_dummies() Method

import pandas as pd

students_df = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303],
        "Name": ["Mike", "Christine", "Rob", "Daniel", "Jennifer"],
        "Sex": ["Male", "Female", "Male", "Male", "Female"],
    }
)

students_df_dummies = pd.get_dummies(students_df)

print("The original DataFrame is:")
print(students_df, "\n")

print("DataFrame with Dummies:")
print(students_df_dummies)

Output:

The original DataFrame is:
    Id       Name     Sex
0  302       Mike    Male
1  504  Christine  Female
2  708        Rob    Male
3  103     Daniel    Male
4  303   Jennifer  Female 

DataFrame with Dummies:
    Id  Name_Christine  Name_Daniel  Name_Jennifer  Name_Mike  Name_Rob  Sex_Female  Sex_Male
0  302               0            0              0          1         0           0         1
1  504               1            0              0          0         0           1         0
2  708               0            0              0          0         1           0         1
3  103               0            1              0          0         0           0         1
4  303               0            0              1          0         0           1         0

It generates a DataFrame with dummy column names formed by concatenating the original column name and each unique value for the column.

For the Name column, we have five unique values, and hence the Name splits to Name_ plus each unique name in the DataFrame. The dummy columns’ values will be 1 or 0 based on the value in the initial DataFrame.

The row with value of Name column Daniel in the students_df DataFrame will have value 1 for the Name_Daniel column in the students_df_dummies DataFrame while all other name values will have value 0 for the Name_Daniel column in the students_df_dummies DataFrame.

Set columns to Create Dummy Variables for Specified Columns Only

By default, the get_dummies() method will create DataFrame with dummy columns for each column with dtypes object or category. We can set pass the list of the columns as columns argument to specify particular columns.

import pandas as pd

students_df = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303],
        "Name": ["Mike", "Christine", "Rob", "Daniel", "Jennifer"],
        "Sex": ["Male", "Female", "Male", "Male", "Female"],
    }
)

students_df_dummies = pd.get_dummies(students_df, columns=["Sex"])

print("The original DataFrame is:")
print(students_df, "\n")

print("DataFrame with Dummies:")
print(students_df_dummies)

Output:

The original DataFrame is:
    Id       Name     Sex
0  302       Mike    Male
1  504  Christine  Female
2  708        Rob    Male
3  103     Daniel    Male
4  303   Jennifer  Female 

DataFrame with Dummies:
    Id       Name  Sex_Female  Sex_Male
0  302       Mike           0         1
1  504  Christine           1         0
2  708        Rob           0         1
3  103     Daniel           0         1
4  303   Jennifer           1         0

It creates dummy variables for the Sex column only.

Set prefix to Change the Default Name of Dummy Columns

import pandas as pd

students_df = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303],
        "Name": ["Mike", "Christine", "Rob", "Daniel", "Jennifer"],
        "Sex": ["Male", "Female", "Male", "Male", "Female"],
    }
)

students_df_dummies = pd.get_dummies(students_df, columns=["Sex"], prefix="Column")

print("The original DataFrame is:")
print(students_df, "\n")

print("DataFrame with Dummies:")
print(students_df_dummies)

Output:

The original DataFrame is:
    Id       Name     Sex
0  302       Mike    Male
1  504  Christine  Female
2  708        Rob    Male
3  103     Daniel    Male
4  303   Jennifer  Female 

DataFrame with Dummies:
    Id       Name  Column_Female  Column_Male
0  302       Mike              0            1
1  504  Christine              1            0
2  708        Rob              0            1
3  103     Daniel              0            1
4  303   Jennifer              1            0

It sets the prefix for the dummy columns generated from the Sex column to Column. Now the dummy column names become Column_Female and Column_Male.

Author: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

Related Article - Pandas DataFrame Column