Pandas DataFrame DataFrame.drop_duplicates() Function

  1. Syntax of pandas.DataFrame.drop_duplicates():
  2. Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index() Method
  3. Example Codes: Set subset Parameter in Pandas DataFrame.set_index() Method
  4. Example Codes: Set keep Parameter in Pandas DataFrame.set_index() Method
  5. Example Codes: Set ignore_index Parameter in Pandas DataFrame.set_index() Method

The Python Pandas DataFrame.drop_duplicates() function removes all the duplicate rows from the DataFrame.

Syntax of pandas.DataFrame.drop_duplicates():

DataFrame.drop_duplicates(subset: Union[Hashable, Sequence[Hashable], NoneType] = None, 
                          keep: Union[str, bool] = 'first', 
                          inplace: bool = False, 
                          ignore_index: bool = False)

Parameters

subset Column label or Sequence of labels. Columns to be considered while identifying duplicates
keep first, last or False. Drop all duplicates except first(keep=first), drop all duplicates except last(keep=first) or drop all duplicates(keep=False)
inplace Boolean. If True modify the caller DataFrame
ignore_index Boolean. If True, the indexes from the original DataFrame is ignored. The default value is False which means the indexes are used.

Return

If inplace is True, a DataFrame removing all the duplicate rows from the DataFrame; otherwise None.

Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','BCD' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates() 

print("DataFrame with Unique Rows:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      BCD
3  Orange     34      Yes      ABC
DataFrame with Unique Rows:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      BCD

The original DataFrame has the 1st and 4th row identical.

You can remove all the duplicate rows from the DataFrame by using the drop_duplicates() method.

Example Codes: Set subset Parameter in Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates(subset ="Supplier") 

print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ

This method removes all the rows in the DataFrame, which do not have unique values of the Supplier column.

Here, the 1st,3rd, and 4th rows have a common value of the Supplier column. So the 3rd and 4th rows are removed from the DataFrame; as by default, the first duplicate row will not be removed.

Example Codes: Set keep Parameter in Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates(subset ="Supplier",keep="last") 

print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
1   Mango     24       No      XYZ
3  Orange     34      Yes      ABC

This method removes all the rows in the DataFrame, which do not have unique values of the Supplier column, keeping the last duplicate row only.

Here, the 1st,3rd, and 4th rows have a common value of the Supplier column. So the 1st and 3rd rows are removed from the DataFrame.

Example Codes: Set ignore_index Parameter in Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df.drop_duplicates(subset ="Supplier",keep="last",inplace=True,ignore_index=True) 

print("DataFrame with Unique vales of Supplier Column:")
print(df)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
0   Mango     24       No      XYZ
1  Orange     34      Yes      ABC

Here, as ignore_index is set to True, the indexes from the original DataFrame are ignored, and new indices are set for the row.

Due to theinplace=True function, the original DataFrame is modified after calling the ignore_index() function.