在 Pandas join 方法中指定后缀

Suraj Joshi 2023年1月30日
  1. 使用 DataFrame.join() 方法连接两个 DataFrame
  2. 使用 DataFrame.join() 方法连接具有共同列名的 DataFrame
在 Pandas join 方法中指定后缀

本教程解释了如何在 Pandas 中使用 DataFrame.join() 方法加入两个 DataFrame,并在加入时指定后缀。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

print("Student DataFrame:")
print(student_df, "\n")

print("Grades DataFrame:")
print(grades_df)

输出:

Student DataFrame:
       Name  Gender  Age
0  Jennifer  Female   17
1    Travis    Male   18
2       Bob    Male   17
3      Emma  Female   16
4      Luna  Female   18
5     Anish    Male   16 

Grades DataFrame:
   Roll No Grades
0      501      A
1      502     B+
2      503     A-
3      504      A
4      505      B
5      506     A+

我们将通过演示连接 students_dfgrades_df 两个 DataFrame 来解释 DataFrame.join() 方法。

使用 DataFrame.join() 方法连接两个 DataFrame

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

joined_df = student_df.join(grades_df)

print("Student DataFrame:")
print(student_df, "\n")

print("Grades DataFrame:")
print(grades_df, "\n")

print("Joined DataFrame:")
print(joined_df, "\n")

输出:

Student DataFrame:
       Name  Gender  Age
0  Jennifer  Female   17
1    Travis    Male   18
2       Bob    Male   17
3      Emma  Female   16
4      Luna  Female   18
5     Anish    Male   16 

Grades DataFrame:
   Roll No Grades
0      501      A
1      502     B+
2      503     A-
3      504      A
4      505      B
5      506     A+ 

Joined DataFrame:
       Name  Gender  Age  Roll No Grades
0  Jennifer  Female   17      501      A
1    Travis    Male   18      502     B+
2       Bob    Male   17      503     A-
3      Emma  Female   16      504      A
4      Luna  Female   18      505      B
5     Anish    Male   16      506     A+ 

它将 student_dfgrades_df 连接起来,并创建 joined_df。默认情况下,join() 方法使用两个 DataFrame 的索引来连接它们。加入方法默认为 Left Join。在这里,左边 DataFrame 中的所有行,即 student_df 被保存在 joined_df 中,而右边 DataFrame 中的一行与左边 DataFrame 中的一行具有相同的索引值,被加入并放置在同一行中。

使用 DataFrame.join() 方法连接具有共同列名的 DataFrame

如果我们使用 DataFrame.join() 方法试图连接的两个 DataFrames 中都有一个名称相同的列,我们会得到一个错误信息 ValueError: columns overlap but no suffix specified。我们可以在 DataFrame.join() 方法中设置 lsuffixrsuffix 参数的值来解决这个错误。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

joined_df = student_df.join(grades_df, lsuffix="_left", rsuffix="_right")

print("Student DataFrame:")
print(student_df, "\n")

print("Grades DataFrame:")
print(grades_df, "\n")

print("Joined DataFrame:")
print(joined_df, "\n")

输出:

Student DataFrame:
   Roll No      Name  Gender  Age
0      501  Jennifer  Female   17
1      502    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

Grades DataFrame:
   Roll No Grades
0      501      A
1      502     B+
2      503     A-
3      504      A
4      505      B
5      506     A+ 

Joined DataFrame:
   Roll No_left      Name  Gender  Age  Roll No_right Grades
0           501  Jennifer  Female   17            501      A
1           502    Travis    Male   18            502     B+
2           503       Bob    Male   17            503     A-
3           504      Emma  Female   16            504      A
4           505      Luna  Female   18            505      B
5           506     Anish    Male   16            506     A+ 

它将 grades_df 加入到 student_df 的右边。DataFrame.join() 并不能合并各个 DataFrame,也就是说,即使 Roll No 列是两个 DataFrame 共同的,它们也会在 join() 之后被作为单独的字段放置。为了区分具有共同名称的列名,我们使用 lsuffixrsuffix 参数为左右两个 DataFrame 中的列提供后缀。

另外,我们也可以通过 DataFrame.merge() 方法,将常用的列名作为 on 参数传入该方法中来解决这个问题。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

merged_df = student_df.merge(grades_df, on="Roll No")

print("Student DataFrame:")
print(student_df, "\n")

print("Grades DataFrame:")
print(grades_df, "\n")

print("Merged DataFrame:")
print(merged_df, "\n")

输出:

Student DataFrame:
   Roll No      Name  Gender  Age
0      501  Jennifer  Female   17
1      502    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

Grades DataFrame:
   Roll No Grades
0      501      A
1      502     B+
2      503     A-
3      504      A
4      505      B
5      506     A+ 

Merged DataFrame:
   Roll No      Name  Gender  Age Grades
0      501  Jennifer  Female   17      A
1      502    Travis    Male   18     B+
2      503       Bob    Male   17     A-
3      504      Emma  Female   16      A
4      505      Luna  Female   18      B
5      506     Anish    Male   16     A+ 

它将 DataFrames student_dfgrades_df 合并成一个 DataFrame。在这种情况下,Roll No 列将被合并为两个 DataFrame 的单一列。

作者: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn