Compare Pandas DataFrame Object

Compare Pandas DataFrame Object

Suraj Joshi Apr-12, 2022 Jan-16, 2021 Pandas Pandas DataFrame

This tutorial explains how we can compare Pandas DataFrame objects in Python. We can compare DataFrames using the == operator.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print("df_1:")
print(df_1)

print("")

print("df_2:")
print(df_2)

Output:

df_1:
        Player  Goals
0  Lewandowski     10
1       Haland      8
2      Ronaldo      6
3        Messi      5
4       Mbappe      4

df_2:
        Player  Goals
0  Lewandowski      7
1       Haland      8
2      Ronaldo      6
3        Messi      7
4       Mbappe      4

We will use the DataFrames df_1 and df_2 to demonstrate the comparison of DataFrames in this article.

Compare Pandas DataFrame Object Using the == Operator

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print(df_1 == df_2)

Output:

   Player  Goals
0    True  False
1    True   True
2    True   True
3    True  False
4    True   True

It compares the corresponding elements of df_1 ad df_2 and returns True if the corresponding elements of that position are the same, otherwise it returns False.

We can use pandas.DataFrame.all() method to know which rows are same in both df_1 and df_2.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print((df_1 == df_2).all(axis=1))

Output:

0    False
1     True
2     True
3    False
4     True
dtype: bool

The rows with True value in the output have the same value as the corresponding elements. Hence, the rows with False value in the output have different values of corresponding elements.

We can use indexing to list all the rows whose values differ in df_1 and df_2.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print(df_1[(df_1 == df_2).all(axis=1) == False])

Output:

        Player  Goals
0  Lewandowski     10
3        Messi      5

It lists all the rows of df_1, which have different values than corresponding rows in df_2.

If we have different indexes for df_1 and df_2, we get an error saying ValueError: Can only compare identically-labeled DataFrame objects.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=['a', 'b', 'c', 'd', 'e'])

print(df_1 == df_2)

Output:

Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled DataFrame objects

We can use the [pandas.DataFrame.reset_index() method]](/api/python-pandas/pandas-dataframe-dataframe.reset_index-function/) to reset the indices to overcome the above issue.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=['a', 'b', 'c', 'd', 'e'])
df_2.reset_index(drop=True, inplace=True)

print(df_1 == df_2)

Output:

   Player  Goals
0    True  False
1    True   True
2    True   True
3    True  False
4    True   True

It resets the index of df_2 before comparing df_1 and df_2 so that two dataframes have the same indices to make the comparison possible.

We must also make sure we have the same numbers of rows in DataFrames before comparing them.

Related Article - Pandas DataFrame

  • Get Pandas DataFrame Column Headers as a List
  • Delete Pandas DataFrame Column
  • Convert Pandas Column to Datetime
  • Convert a Float to an Integer in Pandas DataFrame
  • Sort Pandas DataFrame by One Column's Values
  • Get the Aggregate of Pandas Group-By and Sum