Pandas 多列合并

Suraj Joshi 2023年1月30日
  1. Pandas DataFrame 不含任何键列的默认合并
  2. Pandas 设置 on 参数的值来指定合并的键值
  3. 使用 left_onright_on 合并 DataFrame
Pandas 多列合并

本教程介绍了如何在 Pandas 中使用 DataFrame.merge() 方法合并两个 DataFrame。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

输出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
   Roll No      Name Grades
0      501  Jennifer      A
1      502    Travis     B+
2      503       Bob     A-
3      504      Emma      A
4      505      Luna      B
5      506     Anish     A+ 

我们将使用 DataFrame student_dfgrades_df 来演示 DataFrame.merge() 的工作。

Pandas DataFrame 不含任何键列的默认合并

如果我们只使用传递两个 DataFrames 来合并到 merge() 方法,该方法将收集两个 DataFrame 中的所有公共列,并将两个 DataFrame 中的每个公共列替换为一个。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

merged_df = pd.merge(student_df, grades_df)

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

输出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
   Roll No      Name Grades
0      501  Jennifer      A
1      502    Travis     B+
2      503       Bob     A-
3      504      Emma      A
4      505      Luna      B
5      506     Anish     A+ 

Merged df:
   Roll No   Name  Gender  Age Grades
0      503    Bob    Male   17     A-
1      504   Emma  Female   16      A
2      505   Luna  Female   18      B
3      506  Anish    Male   16     A+

它将合并 DataFrame student_dfgrades_df,并分配给 merged_df。我们有两列 Roll NoName 是两个 DataFrame 共有的,但 merge() 函数会将每个通用列合并为一列。

Pandas 设置 on 参数的值来指定合并的键值

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

merged_df = pd.merge(student_df, grades_df, on="Roll No")

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

输出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
   Roll No      Name Grades
0      501  Jennifer      A
1      502    Travis     B+
2      503       Bob     A-
3      504      Emma      A
4      505      Luna      B
5      506     Anish     A+ 

Merged df:
   Roll No  Name_x  Gender  Age    Name_y Grades
0      501  Travis    Male   18  Jennifer      A
1      503     Bob    Male   17       Bob     A-
2      504    Emma  Female   16      Emma      A
3      505    Luna  Female   18      Luna      B
4      506   Anish    Male   16     Anish     A+

这里,我们设置 on="Roll No"merge() 函数将在两个 DataFrame 中找到 Roll No 命名的列,我们在 merged_df 将会只有一个 Roll No 列。虽然 Name 列在两个 DataFrames 中也是通用的,但由于 Name 不作为 on 参数传递,所以我们为左右 DataFrame 的 Name 列单独设置了一列,分别由 Name_xName_y 表示。

使用 left_onright_on 合并 DataFrame

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {"Id": [501, 502, 503, 504, 505, 506], "Grades": ["A", "B+", "A-", "A", "B", "A+"]}
)

merged_df = pd.merge(student_df, grades_df, left_on="Roll No", right_on="Id")

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

输出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
    Id Grades
0  501      A
1  502     B+
2  503     A-
3  504      A
4  505      B
5  506     A+ 

Merged df:
   Roll No    Name  Gender  Age   Id Grades
0      501  Travis    Male   18  501      A
1      503     Bob    Male   17  503     A-
2      504    Emma  Female   16  504      A
3      505    Luna  Female   18  505      B
4      506   Anish    Male   16  506     A+

如果我们要合并的一列在 DataFrames 中有不同的列名,我们可以使用 left_onright_on 参数。left_on 将被设置为左边 DataFrame 中的列名,right_on 将被设置为右边 DataFrame 中的列名。

作者: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn