What Is the Difference Between Join and Merge in Pandas

What Is the Difference Between Join and Merge in Pandas

Samreena Aslam Mar-29, 2022 Jan-25, 2022 Pandas Pandas DataFrame
  1. Pandas DataFrame .join Method
  2. Pandas DataFrame .merge Method
  3. Conclusion

In Pandas, Series or DataFrame can easily join or combine using various operations such as join and merge. These operations combine two DataFrames based on the indexes and column name. Both join and merge methods can combine two DataFrames. The main difference between the join and merge operation is that the join method combines two DataFrames based on their indexes, whereas in the merge method, we need to specify columns to combine both DataFrames.

This article will discuss the difference between the join and merge methods in pandas python.

Pandas DataFrame .join Method

The join method joins the two dataframs on their indexes. Let’s take an example to show the working of the join method. We have taken two DataFrames: left_df and right_df. Using the left_df.join(right_df) code, we have combined two DataFrames.

Example Code:

import pandas as pd

# create two dataframe
df_left = pd.DataFrame({'Name': ['X', 'Y', 'Z'], 'Score': [10, 8, 9]}).set_index('Name')
df_right = pd.DataFrame({'Name': ['X', 'Y', 'Z'], 'Steals': [4, 5, 2]}).set_index('Name')
print(df_left)
print(df_right)

# join two dataframes
df_left.join(df_right)

Output:

      Score
Name       
X        10
Y         8
Z         9
      Steals
Name        
X          4
Y          5
Z          2
Score	Steals
Name		
X	10	4
Y	8	5
Z	9	2

If we have overlapping columns in both DataFrames, in this case, the join will want you to add a suffix to the overlapping or common column name from the left dataframe. In the following DataFrames, the overlapping column name is C.

Example Code:

import pandas as pd
  
# Creating the two dataframes
df_left = pd.DataFrame([['x', 1], ['y', 2]], list('AB'), list('CD'))
df_right = pd.DataFrame([['u', 3], ['v', 4]], list('AB'), list('CF'))
print(df_left)
print(df_right)
# join two dataframes
joined_df = df_left.join(df_right, lsuffix='_')
print(joined_df)

Output:

   C  D
A  x  1
B  y  2
   C  F
A  u  3
B  v  4
  C_  D  C  F
A  x  1  u  3
B  y  2  v  4

As you can see in the above output, the index is preserved with four columns. We can also specify a particular column separately on the left dataframe by using the on parameter as the join key.

Pandas DataFrame .merge Method

The merge method is also used to combine two DataFrames. But, the merge method requires the column name as the merge key for combining the two DataFrames. In the following example, we implement the simple merge function to combine two dataframes without using any parameter.

Example Code:

import pandas as pd

# create two dataframe
df_left = pd.DataFrame({'Name': ['X', 'Y', 'Z'], 'Score': [10, 8, 9]}).set_index('Name')
df_right = pd.DataFrame({'Name': ['X', 'Y', 'Z'], 'Steals': [4, 5, 2]}).set_index('Name')
print(df_left)
print(df_right)

# merge two dataframes
df_left.merge(df_right, on='Name')

Output:

      Score
Name       
X        10
Y         8
Z         9
      Steals
Name        
X          4
Y          5
Z          2
Score	Steals
Name		
X	10	4
Y	8	5
Z	9	2

We can specify the overlapping column name with the on parameter in the merge method. In the following example, we specify the overlapping column name C to perform the merge operation on two DataFrames.

Example Code:

import pandas as pd
  
# Creating the two dataframes
df_left = pd.DataFrame([['x', 1], ['y', 2]], list('AB'), list('CD'))
df_right = pd.DataFrame([['u', 3], ['v', 4]], list('AB'), list('CF'))
print(df_left)
print(df_right)

# merge dataframes
merged_df = df_left.merge(df_right, on='C', how='outer')
print(merged_df)

Output:

   C  D
A  x  1
B  y  2
   C  F
A  u  3
B  v  4
   C    D    F
0  x  1.0  NaN
1  y  2.0  NaN
2  u  NaN  3.0
3  v  NaN  4.0

To separately specify it using right_on and left_on parameters. See the following example, in which we have used different parameters such as on, left_on, right_on for better understanding.

Example Code:

import pandas as pd
  
# Creating the two dataframes
df_left = pd.DataFrame([['x', 1], ['y', 2]], list('AB'), list('CD'))
df_right = pd.DataFrame([['u', 3], ['v', 4]], list('AB'), list('CF'))
print(df_left)
print(df_right)
merged_df = df_left.merge(df_right, left_index=True,
                       right_index=True, suffixes=['_', ''])
print(merged_df)

Output:

   C  D
A  x  1
B  y  2
   C  F
A  u  3
B  v  4
  C_  D  C  F
A  x  1  u  3
B  y  2  v  4

Conclusion

We demonstrated the difference between the join and merge in pandas with the help of some examples. We have seen both methods, join and merge are used for a similar purpose, combining the DataFrames in pandas. But, the difference is that the join method combines two DataFrames on their indexed, whereas in the merge method, we specify the column name to combine two DataFrames.

Related Article - Pandas DataFrame

  • Get Pandas DataFrame Column Headers as a List
  • Delete Pandas DataFrame Column
  • Convert Pandas Column to Datetime
  • Convert a Float to an Integer in Pandas DataFrame
  • Sort Pandas DataFrame by One Column's Values
  • Get the Aggregate of Pandas Group-By and Sum