Pandas groupby 計數

Suraj Joshi 2023年1月30日
  1. 使用 Series.value_counts() 方法計算唯一行值的數值
  2. 使用 DataFrame.groupby() 函式對 DataFrame 組的值進行計數
  3. 使用 pandas.DataFrame.agg() 方法獲取每組的多個統計值
Pandas groupby 計數

本教程解釋瞭如何從 DataFrame.groupby() 方法中獲取像 countsummax 等派生組的統計資料。

我們將用上面例子中所示的 automobile_data_df 來解釋這些概念。DataFrame 由員工和他們使用的汽車和自行車品牌組成。

import pandas as pd

automobile_data = [
    ["Chevrolet", "Yamaha"],
    ["Toyota", "Ducati"],
    ["Volkswagen", "BMW"],
    ["Cherlovet", "Kawasaki"],
    ["Toyota", "Ducati"],
    ["Toyota", "BMW"],
    ["Volkswagen", "BMW"],
]

column_names = ["Car Brand", "Motorbike Brand"]
index_names = ["Steven", "Robert", "Jimmy", "Mike", "Phoenix", "Scott", "John"]
automobile_data_df = pd.DataFrame(
    automobile_data, index=index_names, columns=column_names
)
print(automobile_data_df)

輸出:

          Car Brand Motorbike Brand
Steven    Chevrolet          Yamaha
Robert       Toyota          Ducati
Jimmy    Volkswagen             BMW
Mike      Cherlovet        Kawasaki
Phoenix      Toyota          Ducati
Scott        Toyota             BMW
John     Volkswagen             BMW

使用 Series.value_counts() 方法計算唯一行值的數值

如果我們想要一個 DataFrame 中單列的每個唯一值的計數,我們可以使用 Series.value_counts() 方法。

import pandas as pd

automobile_data = [
    ["Chevrolet", "Yamaha"],
    ["Toyota", "Ducati"],
    ["Volkswagen", "BMW"],
    ["Cherlovet", "Kawasaki"],
    ["Toyota", "Ducati"],
    ["Toyota", "BMW"],
    ["Volkswagen", "BMW"],
]

column_names = ["Car Brand", "Motorbike Brand"]
index_names = ["Steven", "Robert", "Jimmy", "Mike", "Phoenix", "Scott", "John"]
automobile_data_df = pd.DataFrame(
    automobile_data, index=index_names, columns=column_names
)

print(automobile_data_df["Car Brand"].value_counts())

輸出:

Toyota        3
Volkswagen    2
Cherlovet     1
Chevrolet     1
Name: Car Brand, dtype: int64

它將列印 automobile_data_dfCar Brand 列中所有的唯一值及其計數。該方法只適用於我們希望一次只看到一列的數值計數。

使用 DataFrame.groupby() 函式對 DataFrame 組的值進行計數

我們使用 DataFrame.groupby() 函式將一個 DataFrame 分割成多個組,然後使用 size() 方法計算每個建立組的值。

import pandas as pd

automobile_data = [
    ["Chevrolet", "Yamaha"],
    ["Toyota", "Ducati"],
    ["Volkswagen", "BMW"],
    ["Cherlovet", "Kawasaki"],
    ["Toyota", "Ducati"],
    ["Toyota", "BMW"],
    ["Volkswagen", "BMW"],
]

column_names = ["Car Brand", "Motorbike Brand"]
index_names = ["Steven", "Robert", "Jimmy", "Mike", "Phoenix", "Scott", "John"]
automobile_data_df = pd.DataFrame(
    automobile_data, index=index_names, columns=column_names
)

group = automobile_data_df.groupby(["Car Brand", "Motorbike Brand"])
print(group.size().reset_index(name="counts"))

輸出:

    Car Brand Motorbike Brand  counts
0   Cherlovet        Kawasaki       1
1   Chevrolet          Yamaha       1
2      Toyota             BMW       1
3      Toyota          Ducati       2
4  Volkswagen             BMW       2

它列印一個 DatFrame,其中包含了所有來自 automobile_data_dfCar BrandMotorbike Brand 列的值的唯一組合,以及這些組合在 automobile_data_df DataFrame 中出現的次數。

groupby() 方法將 automobile_data_df 分割成組。Car BrandMotorbike Brand 兩列數值相同的行將被放置在同一組中。size() 方法會給出每組的值的數量,最後我們根據每組的值的數量生成 DataFrame。

使用 pandas.DataFrame.agg() 方法獲取每組的多個統計值

import pandas as pd

automobile_data = [
    ["Chevrolet", "Yamaha", 4000],
    ["Toyota", "Ducati", 5000],
    ["Volkswagen", "BMW", 4500],
    ["Cherlovet", "Kawasaki", 3800],
    ["Toyota", "Ducati", 4500],
    ["Toyota", "BMW", 5000],
    ["Volkswagen", "BMW", 6000],
]

column_names = ["Car Brand", "Motorbike Brand", "Monthly Salary"]
index_names = ["Steven", "Robert", "Jimmy", "Mike", "Phoenix", "Scott", "John"]
automobile_data_df = pd.DataFrame(
    automobile_data, index=index_names, columns=column_names
)

group = automobile_data_df.groupby(["Car Brand", "Motorbike Brand"])

print(group.agg({"Monthly Salary": [min, max, sum]}))

輸出:

                           Monthly Salary             
                                      min   max    sum
Car Brand  Motorbike Brand                            
Cherlovet  Kawasaki                  3800  3800   3800
Chevrolet  Yamaha                    4000  4000   4000
Toyota     BMW                       5000  5000   5000
           Ducati                    4500  5000   9500
Volkswagen BMW                       4500  6000  10500

它顯示每個生成組的 Monthly Salary 值的最小值、最大值和總和。

作者: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

相關文章 - Pandas Groupby