Pandas Drop Duplicate Rows

Pandas Drop Duplicate Rows

  1. DataFrame.drop_duplicates() Syntax
  2. Remove Duplicate Rows Using the DataFrame.drop_duplicates() Method
  3. Set keep='last' in the drop_duplicates() Method

This tutorial explains how we can remove all the duplicate rows from a Pandas DataFrame using the DataFrame.drop_duplicates() method.

DataFrame.drop_duplicates() Syntax

DataFrame.drop_duplicates(subset=None, 
                          keep='first', 
                          inplace=False, 
                          ignore_index=False)

It returns a DataFrame removing all the repeated rows in the DataFrame.

Remove Duplicate Rows Using the DataFrame.drop_duplicates() Method

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates()

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300 

It removes the rows having the same values all for all the columns. By default, only the rows having the same values for each column in the DataFrame are considered as duplicates. In the df_with_duplicates DataFrame, the first and fifth row have the same values for all the columns, s that the fifth row is removed.

Set subset Parameter to Remove Duplicates Based on Specific Columns Only

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=['Name'])

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100 

Here, we pass Name as a subset argument to the drop_duplicates() method. The fourth and fifth rows are removed as they have the same value of the Name column as the first column.

Set keep='last' in the drop_duplicates() Method

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates(
    subset=['Name'], keep="last")

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
5  302   Watch  300 

It removes all the rows except the last row having the same value as the Name column.

We set keep=False to remove all the rows having the same value of any column.

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates(
    subset=['Name'], keep=False)

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100 

It removes the first, fifth, and sixth row as they all have the same value for the Name column.

Related Article - Pandas DataFrame Row

  • Get the Row Count of a Pandas DataFrame
  • Randomly Shuffle DataFrame Rows in Pandas
  • Filter Dataframe Rows Based on Column Values in Pandas
  • Iterate Through Rows of a DataFrame in Pandas
  • Get Index of All Rows Whose Particular Column Satisfies Given Condition in Pandas
  • Find Duplicate Rows in a DataFrame Using Pandas