How to Drop Duplicate Pandas Rows

Suraj Joshi Feb 02, 2024 Pandas Pandas DataFrame Row

DataFrame.drop_duplicates() Syntax
Remove Duplicate Rows Using the DataFrame.drop_duplicates() Method
Set keep='last' in the drop_duplicates() Method

This tutorial explains how we can remove all the duplicate rows from a Pandas DataFrame using the DataFrame.drop_duplicates() method.

`DataFrame.drop_duplicates()` Syntax

DataFrame.drop_duplicates(subset=None, keep="first", inplace=False, ignore_index=False)

It returns a DataFrame removing all the repeated rows in the DataFrame.

Remove Duplicate Rows Using the `DataFrame.drop_duplicates()` Method

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates()

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300

It removes the rows having the same values all for all the columns. By default, only the rows having the same values for each column in the DataFrame are considered as duplicates. In the df_with_duplicates DataFrame, the first and fifth row have the same values for all the columns, s that the fifth row is removed.

Set `subset` Parameter to Remove Duplicates Based on Specific Columns Only

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"])

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100

Here, we pass Name as a subset argument to the drop_duplicates() method. The fourth and fifth rows are removed as they have the same value of the Name column as the first column.

Set `keep='last'` in the `drop_duplicates()` Method

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep="last")

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
5  302   Watch  300

It removes all the rows except the last row having the same value as the Name column.

We set keep=False to remove all the rows having the same value of any column.

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep=False)

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100

It removes the first, fifth, and sixth row as they all have the same value for the Name column.

Enjoying our tutorials? Subscribe to DelftStack on YouTube to support us in creating more high-quality video guides. Subscribe

Author: Suraj Joshi

Suraj Joshi is a backend software engineer at Matrice.ai.

DataFrame.drop_duplicates() Syntax

Remove Duplicate Rows Using the DataFrame.drop_duplicates() Method

Set subset Parameter to Remove Duplicates Based on Specific Columns Only

Set keep='last' in the drop_duplicates() Method

Related Article - Pandas DataFrame Row

`DataFrame.drop_duplicates()` Syntax

Remove Duplicate Rows Using the `DataFrame.drop_duplicates()` Method

Set `subset` Parameter to Remove Duplicates Based on Specific Columns Only

Set `keep='last'` in the `drop_duplicates()` Method