How to Compare Pandas DataFrame Object

Suraj Joshi Feb 02, 2024
How to Compare Pandas DataFrame Object

This tutorial explains how we can compare Pandas DataFrame objects in Python. We can compare DataFrames using the == operator.

import pandas as pd

data_season1 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [10, 8, 6, 5, 4],
}

data_season2 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [7, 8, 6, 7, 4],
}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print("df_1:")
print(df_1)

print("")

print("df_2:")
print(df_2)

Output:

df_1:
        Player  Goals
0  Lewandowski     10
1       Haland      8
2      Ronaldo      6
3        Messi      5
4       Mbappe      4

df_2:
        Player  Goals
0  Lewandowski      7
1       Haland      8
2      Ronaldo      6
3        Messi      7
4       Mbappe      4

We will use the DataFrames df_1 and df_2 to demonstrate the comparison of DataFrames in this article.

Compare Pandas DataFrame Object Using the == Operator

import pandas as pd

data_season1 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [10, 8, 6, 5, 4],
}

data_season2 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [7, 8, 6, 7, 4],
}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print(df_1 == df_2)

Output:

   Player  Goals
0    True  False
1    True   True
2    True   True
3    True  False
4    True   True

It compares the corresponding elements of df_1 ad df_2 and returns True if the corresponding elements of that position are the same, otherwise it returns False.

We can use pandas.DataFrame.all() method to know which rows are same in both df_1 and df_2.

import pandas as pd

data_season1 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [10, 8, 6, 5, 4],
}

data_season2 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [7, 8, 6, 7, 4],
}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print((df_1 == df_2).all(axis=1))

Output:

0    False
1     True
2     True
3    False
4     True
dtype: bool

The rows with True value in the output have the same value as the corresponding elements. Hence, the rows with False value in the output have different values of corresponding elements.

We can use indexing to list all the rows whose values differ in df_1 and df_2.

import pandas as pd

data_season1 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [10, 8, 6, 5, 4],
}

data_season2 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [7, 8, 6, 7, 4],
}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print(df_1[(df_1 == df_2).all(axis=1) == False])

Output:

        Player  Goals
0  Lewandowski     10
3        Messi      5

It lists all the rows of df_1, which have different values than corresponding rows in df_2.

If we have different indexes for df_1 and df_2, we get an error saying ValueError: Can only compare identically-labeled DataFrame objects.

import pandas as pd

data_season1 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [10, 8, 6, 5, 4],
}

data_season2 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [7, 8, 6, 7, 4],
}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=["a", "b", "c", "d", "e"])

print(df_1 == df_2)

Output:

Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled DataFrame objects

We can use the [pandas.DataFrame.reset_index() method]](/api/python-pandas/pandas-dataframe-dataframe.reset_index-function/) to reset the indices to overcome the above issue.

import pandas as pd

data_season1 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [10, 8, 6, 5, 4],
}

data_season2 = {
    "Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
    "Goals": [7, 8, 6, 7, 4],
}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=["a", "b", "c", "d", "e"])
df_2.reset_index(drop=True, inplace=True)

print(df_1 == df_2)

Output:

   Player  Goals
0    True  False
1    True   True
2    True   True
3    True  False
4    True   True

It resets the index of df_2 before comparing df_1 and df_2 so that two dataframes have the same indices to make the comparison possible.

We must also make sure we have the same numbers of rows in DataFrames before comparing them.

Author: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

Related Article - Pandas DataFrame