Pandas Anti-Join

Pandas Anti-Join

  1. Pandas Anti-Join and Its Types
  2. Perform the Left Anti-Join in Pandas
  3. Perform the Right Anti-Join in Pandas

This tutorial describes Pandas’ anti-joins, briefly describes their types and demonstrates each using example codes.

Pandas Anti-Join and Its Types

By using anti-join, we can return all the rows (also known as records and documents) in one dataset that do not find the matching value in the other dataset. These are used to manipulate datasets as per the project requirements.

pandas anti join - visualize anti join

There are two types of anti-joins in Pandas that are listed below:

  • Left Anti-Join - It returns the records in the left dataset that have no matching records in the right dataset.
  • Right Anti-Join - It returns the records in the right dataset that have not matched those in the left dataset.

See the above tabular representation. We are using the left anti-join, which returns the left dataset, excluding the intersection.

Note that it only returns columns from the left dataset and not from the right.

Similarly, using the right anti-join will return the right dataset, excluding the intersection. Like left anti-join, it will also return columns from the right dataset only and not from the left.

Let’s learn how we can use both of these anti-joins in Pandas. Note that you must have enough knowledge of SQL joins to grasp the anti-joins firmly.

Perform the Left Anti-Join in Pandas

  • Import the library.
    import pandas as pd
    

    First, we import the pandas library to play with data frames.

  • Create two data frames.
    #first DataFrame
    df1 = pd.DataFrame({'sections': ['A', 'B', 'C', 'D', 'E'],
    				  'points': [19, 23, 20, 15, 31]})
    
    print("First DataFrame:")
    print(df1)
    
    #second DataFrame
    df2 = pd.DataFrame({'sections': ['A', 'B', 'C', 'F', 'G'],
    				  'points': [19, 23, 20, 24, 30]})
    
    print("\n\nSecond DataFrame:")
    print(df2)
    

    We create two data frames containing sample points for different sections that you can see below.

    OUTPUT:

    First DataFrame:
    sections points
    0        A      19
    1        B      23
    2        C      20
    3        D      15
    4        E      31
    
    
    Second DataFrame:
    sections points
    0        A      19
    1        B      23
    2        C      20
    3        F      24
    4        G      30
    
  • Perform the outer join.
    outer_join = df1.merge(df2, how='outer', indicator=True)
    print(outer_join)
    

    To perform anti-join, we need to go through the outer join, which returns the matched and unmatched values from either or both datasets. We are getting matched and unmatched values from both datasets (you can see this in the following output).

    Here, the merge() method is used to update the data of two data frames by using a particular method(s) to merge them. We are using a few parameters to control what values to replace and what to keep.

    • df2 - It is another data frame to merge with.
    • how - It specifies how to merge. It is an optional parameter whose values can be left, right, outer, inner, or cross; by default, it is inner.
    • indicator - It can be set to True, False, or a string type value. If we set it to True, it will add the _merge column to the output data frame having the information on a source of every row.

    We can also give a different name to the _merge column by specifying a string argument. This column will contain the categorical type values, for instance, left_only, right_only, and both, as we have in the following output.

    Here, both means if the merge key of the observation (row) is found in both datasets, left_only shows that the merge key of the observation is found in the left dataset only, while right_only indicates that the observation’s merge key is found in right dataset only.

    OUTPUT:

    sections points      _merge
    0        A      19        both
    1        B      23        both
    2        C      20        both
    3        D      15   left_only
    4        E      31   left_only
    5        F      24 right_only
    6        G      30 right_only
    
  • Perform the left anti-join.
    lef_anti_join = outer_join[(outer_join._merge=='left_only')].drop('_merge', axis=1)
    print(lef_anti_join)
    

    First, we use outer_join[(outer_join._merge=='left_only')] to retrieve all rows having a left_only value in the _merge column, then chain it with the .drop() method, which drops the _merge column from the output data frame.

    OUTPUT:

    sections points
    3        D      15
    4        E      31
    

    Now see, we get columns from the left data frame (df1), excluding the intersection.

  • See the complete source code to perform the left anti-join in Pandas.
    import pandas as pd
    
    #first DataFrame
    df1 = pd.DataFrame({'sections': ['A', 'B', 'C', 'D', 'E'],
    				  'points': [19, 23, 20, 15, 31]})
    print("First DataFrame:")
    print(df1)
    
    #second DataFrame
    df2 = pd.DataFrame({'sections': ['A', 'B', 'C', 'F', 'G'],
    				  'points': [19, 23, 20, 24, 30]})
    print("\n\nSecond DataFrame:")
    print(df2)
    
    #outer join
    outer_join = df1.merge(df2, how='outer', indicator=True)
    
    #left anti join
    lef_anti_join = outer_join[(outer_join._merge=='left_only')].drop('_merge', axis=1)
    print("\n\nLeft Anti-join:")
    print(lef_anti_join)
    

    OUTPUT:

    First DataFrame:
    sections points
    0        A      19
    1        B      23
    2        C      20
    3        D      15
    4        E      31
    
    
    Second DataFrame:
    sections points
    0        A      19
    1        B      23
    2        C      20
    3        F      24
    4        G      30
    
    
    Left Anti-join:
    sections points
    3        D      15
    4        E      31
    

{{ % /step %}}

Perform the Right Anti-Join in Pandas

We have learned step-by-step how to perform left anti-join in Pandas. The right anti-join can also be done similarly, but here, we will select those rows with a right_only value in the _merge column.

Example Code:

import pandas as pd

#first DataFrame
df1 = pd.DataFrame({'sections': ['A', 'B', 'C', 'D', 'E'],
                    'points': [19, 23, 20, 15, 31]})
print("First DataFrame:")
print(df1)

#second DataFrame
df2 = pd.DataFrame({'sections': ['A', 'B', 'C', 'F', 'G'],
                    'points': [19, 23, 20, 24, 30]})
print("\n\nSecond DataFrame:")
print(df2)

#outer join
outer_join = df1.merge(df2, how='outer', indicator=True)

#right anti join
right_anti_join = outer_join[(outer_join._merge=='right_only')].drop('_merge', axis=1)
print("\n\nRight Anti-join:")
print(right_anti_join)

OUTPUT:

First DataFrame:
  sections  points
0        A      19
1        B      23
2        C      20
3        D      15
4        E      31


Second DataFrame:
  sections  points
0        A      19
1        B      23
2        C      20
3        F      24
4        G      30


Right Anti-join:
  sections  points
5        F      24
6        G      30

This time, we get columns from the right data frame (df2), excluding the intersection.

Mehvish Ashiq avatar Mehvish Ashiq avatar

Mehvish Ashiq is a former Java Programmer and a Data Science enthusiast who leverages her expertise to help others to learn and grow by creating interesting, useful, and reader-friendly content in Computer Programming, Data Science, and Technology.

LinkedIn GitHub Facebook