How to Compare Two CSV Files and Print Differences Using Python

Faaiq Bilal Feb 02, 2024
  1. Method 1: Compare Two CSV Files Using the Most Pythonic Solution
  2. Method 2: Compare Two CSV Files Using csv-diff - An External Module
  3. Method 3: Compare Two CSV Files Using Pandas DataFrames
How to Compare Two CSV Files and Print Differences Using Python

This article will discuss various methods of comparing two CSV files. We will include the most “Pythonic” way of performing this operation and an external Python module that can help simplify this task.

Lastly, we will include a method using Pandas DataFrames to identify differences in the CSV files.

We will assume that the two CSV files we need to compare are titled file1.csv and file2.csv. You can rename the files as you see fit.

Please also replace the file names appropriately in the code snippets given below.

For example purposes, we have our files setup as follows:

file1.csv:

1,2,3,4,5,6
4,5,6,7,8,9
1,3,4,5,6,1

file2.csv:

1,2,3,4,5,6
4,5,6,7,8,9
2,3,1,4,1,5

Method 1: Compare Two CSV Files Using the Most Pythonic Solution

In this method, we read the file’s contents into two lists, iterate over one of the lists and check whether or not each of the lines exists in the second list. Logically, this is a very simple solution.

Python’s underlying efficiencies make this comparison fairly efficient, despite what it looks like.

with open("file1.csv", "r") as file1, open("file2.csv", "r") as file2:
    f1_contents = file1.readlines()
    f2_contents = file2.readlines()

for line in f1_contents:
    if line not in f2_contents:
        print(line)

for line in f2_contents:
    if line not in f1_contents:
        print(line)

The above code snippet will print the differing lines to your terminal.

In our test case, we get the following as output.

1,3,4,5,6,1

2,3,1,4,1,5

Method 2: Compare Two CSV Files Using csv-diff - An External Module

Firstly, install the module using the following command in your terminal.

python3 -m pip install csv-diff

Once installed, you do not need to write a Python script. You can run this directly in the terminal with the following command.

csv-diff file1.csv file2.csv --key=id

Running this command will display the differences on your terminal.

In our test case, we get the following as output.

1 row added, 1 row removed

1 row added

  1: 2
  2: 3
  3: 1
  4: 4
  5: 1
  6: 5

1 row removed

  1: 1
  2: 3
  3: 4
  4: 5
  5: 6
  6: 1

To use this module as part of a Python script, you can write a script similar to the following.

from csv_diff import load_csv, compare

difference = compare(load_csv(open("file1.csv")), load_csv(open("file2.csv")))
print(difference)

The output for this will be the following.

{'added': [{'1': '2', '2': '3', '3': '1', '4': '4', '5': '1', '6': '5'}], 'removed': [{'1': '1', '2': '3', '3': '4', '4': '5', '5': '6', '6': '1'}], 'changed': [], 'columns_added': [], 'columns_removed': []}

Method 3: Compare Two CSV Files Using Pandas DataFrames

The following script can perform this task for you.

import pandas as pd
import sys
import csv


def dataframe_difference(df1: pd.DataFrame, df2: pd.DataFrame, which=None):

    comparison_df = df1.merge(df2, indicator=True, how="outer")

    if which is None:
        diff_df = comparison_df[comparison_df["_merge"] != "both"]
    else:
        diff_df = comparison_df[comparison_df["_merge"] == which]

    return diff_df


if __name__ == "__main__":
    df1 = pd.read_csv("file1.csv", header=None)
    df2 = pd.read_csv("file2.csv", header=None)

    print(dataframe_difference(df1, df2))

Please note that in the read_csv method, the argument header=None is entered because our test file does not have any header. If your file has a header, you can read it using: pd.read_csv("file1.csv"), where file1.csv will be replaced by your file instead.

In case your file is not present in the same directory as your script, please provide the full path to your CSV files.

The above Python scripts should generate an output like:

   0  1  2  3  4  5      _merge
2  1  3  4  5  6  1   left_only
3  2  3  1  4  1  5  right_only

The lines next to left_only and right_only contain all the differences. The line next to _merge only represents indices.

Related Article - Python CSV