Pandas Fuzzy Match

Salman Mehmood Feb 15, 2024
Pandas Fuzzy Match

This article educates how to merge data frames and see how to apply the fuzzy match to compare two pandas’ data frames in python.

Apply Fuzzy Match on Pandas Data Frame in Python

Suppose we have the following use case with two different tables, and we want to merge them into a common column; look at an example.

We have df1, the first data frame, and df2, the second data frame, and both contain the column Company_Name. For example, see the following code block.

import pandas as pd

df1 = [
    "Amazing Art Inc.",
    "Amazing Ants Ltd.",
    "Beautiful Buildings Inc.",
    "Clean Communication Inc.",
    "Dramatic Dramas Ltd.",
]
df2 = [
    "Amazing art gnc.",
    "Amazing Aant Ltd.",
    "Beautiful Buildings Inc",
    "Clear Communication Inc.",
    "Pramatic Dramas Inc.",
]

df1 = pd.DataFrame(df1, columns=["Company_Name"])
df2 = pd.DataFrame(reversed(df2), columns=["Company_Name"])

When we compare them, we will see that they are pretty similar names, but they are not precisely identical such as we have Amazing Art Inc., and in another data frame, we have Amazing Art gnc.. Few changes are made to other items.

When working with real-life data, we cannot match these items; fortunately, there is a solution using the fuzzywuzzy library. Now we will see how to install this library in Jupyter because we are using the Jupyter notebook.

We will import the sys library and use the executable inside the curly brackets, and in front of this, we will write a command.

import sys
!{sys.executable} -m pip install fuzzywuzzy

Output:

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0

After installing successfully, we load the required function and module from the fuzzywuzzy library.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Now we will use the extractOne() function from the process to show how it works, and it will only return the most likely match that it found. So, for example, we want to match Beautiful Bldgfs and then compare it to df1 using the column name.

process.extractOne("Beautiful Bldgfs", df1["Company_Name"])

As we can see, it returns Beautiful Buildings Inc., and 77 is an estimation or confidence that this is the correct match. And in the last numerical value, 2 shows which position it has in the data frame.

('Beautiful Buildings Inc.', 77, 2)

To compare the entire data frame for all the columns, we would create a new column, load the Company_Name column from df1, and map it using the map() function. Using the lambda function we call extractOne().

Inside this function, we put df2 and check the column to find the closest match, and afterwards, we could use indexes because it returns a tuple, and we only want to return the first entry of the tuple.

After calling df1, we see we got the closest matches for all entries.

df1["Best_Match"] = df1["Company_Name"].map(
    lambda x: process.extractOne(x, df2["Company_Name"])[0]
)
df1

Output:

pandas fuzzy match - output

Here is the complete Python code for the above example.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd


df1 = [
    "Amazing Art Inc.",
    "Amazing Ants Ltd.",
    "Beautiful Buildings Inc.",
    "Clean Communication Inc.",
    "Dramatic Dramas Ltd.",
]
df2 = [
    "Amazing art gnc.",
    "Amazing Aant Ltd.",
    "Beautiful Buildings Inc",
    "Clear Communication Inc.",
    "Pramatic Dramas Inc.",
]


df1 = pd.DataFrame(df1, columns=["Company_Name"])
df2 = pd.DataFrame(reversed(df2), columns=["Company_Name"])

print(process.extractOne("Beautiful Bldgfs", df1["Company_Name"]))

df1["Best_Match"] = df1["Company_Name"].map(
    lambda x: process.extractOne(x, df2["Company_Name"])[0]
)
print(df1)
Salman Mehmood avatar Salman Mehmood avatar

Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.

LinkedIn