Merge Pandas DataFrames on Index

  1. Use merge() to Combine Two Pandas DataFrames on Index
  2. Use join() to Combine Two Pandas DataFrames on Index

In the world of Data Science and Machine Learning, it is essential to be fluent in operations for organizing, maintaining, and cleaning data for further analysis. Merging two DataFrames is an example of one such operation. It turns out it is easy to combine two DataFrames using the Pandas library in Python.

Pandas provides us with two useful functions, merge() and join() to combine two DataFrames. Both of these methods are very similar but merge() is considered more versatile and flexible. It also provides many parameters to alter the behavior of the final DataFrame. join() combines the two DataFrames on their indexes, whereas merge() allows us to specify the columns which can act as key to merge two DataFrames.

One common parameter of both these functions about which one should be familiar with is how, which defines the type of join. By default how parameter is inner for merge() and left for join(), but for both it can be changed to left, right, inner, and outer. It is essential to know the difference between all of them.

While combining two Pandas DataFrames, we assume one to be the Left DataFrame and the other to be the Right DataFrame. Both merge() and join() match the records of key columns. The inner join returns a DataFrame of records that match in both DataFrames. The outer join produces a merged DataFrame with all the elements in both DataFrames, filling NaN for missing values on both sides. The left join contains all elements of the left DataFrame but only the matched records of the Right DataFrame. The opposite of left is right, which has all elements of the right DataFrame and only matching records of the Left DataFrame. All this will be more clear in the example code in the coming sections where we will combine the DataFrames in the code below:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(['a','b','d','e','h'],index = [1,2,4,5,7], columns = ['C1'])
df2 = pd.DataFrame(['AA','BB','CC','EE','FF'],index = [1,2,3,5,6], columns = ['C2'])

print(df1)
print(df2)

Output:

  C1
1  a
2  b
4  d
5  e
7  h
   C2
1  AA
2  BB
3  CC
5  EE
6  FF

Use merge() to Combine Two Pandas DataFrames on Index

When merging two DataFrames on the index, the value of left_index and right_index parameters of merge() function should be True. The following code example will combine two DataFrames with inner as the join type:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(['a','b','d','e','h'],index = [1,2,4,5,7], columns = ['C1'])
df2 = pd.DataFrame(['AA','BB','CC','EE','FF'],index = [1,2,3,5,6], columns = ['C2'])

df_inner = df1.merge(df2, how='inner', left_index=True, right_index=True)

print(df_inner)

Output:

  C1  C2
1  a  AA
2  b  BB
5  e  EE

The following code will merge the DataFrames with join type as outer:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(['a','b','d','e','h'],index = [1,2,4,5,7], columns = ['C1'])
df2 = pd.DataFrame(['AA','BB','CC','EE','FF'],index = [1,2,3,5,6], columns = ['C2'])

df_outer = df1.merge(df2, how='outer', left_index=True, right_index=True)

print(df_outer)

Output:

    C1   C2
1    a   AA
2    b   BB
3  NaN   CC
4    d  NaN
5    e   EE
6  NaN   FF
7    h  NaN

As you can see, the merged DataFrame with join type as inner only has matching records from both DataFrames, whereas the one with outer join has all the elements present in them, filling missing records with NaN. Now by using left join:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(['a','b','d','e','h'],index = [1,2,4,5,7], columns = ['C1'])
df2 = pd.DataFrame(['AA','BB','CC','EE','FF'],index = [1,2,3,5,6], columns = ['C2'])

df_left = df1.merge(df2, how='left', left_index=True, right_index=True)

print(df_left)

Output:

  C1   C2
1  a   AA
2  b   BB
4  d  NaN
5  e   EE
7  h  NaN

The above merged DataFrame has all elements from the left DataFrame, and only the matched records from the right DataFrame. Its exact opposite is the right join, as shown below:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(['a','b','d','e','h'],index = [1,2,4,5,7], columns = ['C1'])
df2 = pd.DataFrame(['AA','BB','CC','EE','FF'],index = [1,2,3,5,6], columns = ['C2'])

df_right = df1.merge(df2, how='right', left_index=True, right_index=True)

print(df_right)

Output:

    C1  C2
1    a  AA
2    b  BB
3  NaN  CC
5    e  EE
6  NaN  FF

Use join() to Combine Two Pandas DataFrames on Index

join() method combines the two DataFrames based on their indexes, and by default, the join type is left. It always uses the right DataFrame’s index, but we can mention the key for Left DataFrame. We can specify the join types for join() function same as we mention for merge().

The following example shows merged DataFrame with outer join type:

import pandas as pd
import numpy as np
df1 = pd.DataFrame(['a','b','d','e','h'],index = [1,2,4,5,7], columns = ['C1'])
df2 = pd.DataFrame(['AA','BB','CC','EE','FF'],index = [1,2,3,5,6], columns = ['C2'])
df_outer = df1.join(df2, how='outer')
print(df_outer)

Output:

    C1   C2
1    a   AA
2    b   BB
3  NaN   CC
4    d  NaN
5    e   EE
6  NaN   FF
7    h  NaN
Contribute
DelftStack is a collective effort contributed by software geeks like you. If you like the article and would like to contribute to DelftStack by writing paid articles, you can check the write for us page.

Related Article - Pandas DataFrame

  • Difference Between Pandas apply, map and applymap
  • Count the Frequency a Value Occurs in Pandas Dataframe