How to Groupby Index Columns in Pandas

Luqman Khan Feb 02, 2024
  1. Use the groupby() Function to Group by Index Columns in Python Pandas
  2. Use the groupby() Function on a CSV File Data in Python Pandas
How to Groupby Index Columns in Pandas

This tutorial introduces how groupby in Python Pandas categorizes data and applies a function to the categories. Use the groupby() function to group multiple index columns in Pandas with examples.

Use the groupby() Function to Group by Index Columns in Python Pandas

In this post, Pandas DataFrame data.groupby() function divides data into groups based on specific criteria. Pandas objects can be divided into any number of groups along any axis.

A label-to-group-name mapping is the abstract definition of grouping. A groupby operation splits an object, applies a function, and combines the results.

This is useful for grouping large amounts of data and performing operations. Pandas groupby default behavior converts the groupby columns into indexes and removes them from the DataFrame’s list of columns.

Syntax:

DataFrame.groupby(
    by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True
)

Parameters:

by a mapping, a function, a string, or an iterable
axis int, default value 0
level For the Multi-Index axis, group by a specific level or levels (hierarchical).
as_index As the index for aggregated output, return an object with group labels. This is only applicable to DataFrame input. When index=False, the output is grouped in “SQL-style”.
sort The keys in a group should be sorted. Turn this off to improve performance. It should be noted that the order of observations within each group is unaffected by this. Groupby maintains the order of rows within each group.
group_keys Add group keys to index when calling apply to identify pieces squeeze: If possible, reduce the dimensionality of the return type; otherwise, return a consistent type.

Take a DataFrame with two columns: date and item sell. Groupby both date and item sell and get the user’s item-by count.

First, we need to import necessary libraries, pandas and numpy, create three columns, ct, date, and item_sell and pass a set of values to the columns.

import pandas as pd
import numpy as np

data = pd.DataFrame()
data["date"] = ["a", "a", "a", "b"]
data["item_sell"] = ["z", "z", "a", "a"]
data["ct"] = 1
print(data)

Output:

  date item_sell  ct
0    a         z   1
1    a         z   1
2    a         a   1
3    b         a   1

Group by using date and item_sell columns.

import pandas as pd
import numpy as np

data = pd.DataFrame()
data["date"] = ["a", "a", "a", "b"]
data["item_sell"] = ["z", "z", "a", "a"]
data["ct"] = 1
output = pd.pivot_table(data, values="ct", index=["date", "item_sell"], aggfunc=np.sum)
print(output)

Output:

 				ct
date item_sell
a    a           1
     z           2
b    a           1

The groupby() by parameter can now refer to either column names or index level names.

import pandas as pd
import numpy as np

arrays = [
    ["rar", "raz", "bal", "bac", "foa", "foa", "qus", "qus"],
    ["six", "seven", "six", "seven", "six", "seven", "six", "seven"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
data = pd.DataFrame({"C": [1, 1, 1, 1, 2, 2, 3, 3], "D": np.arange(8)}, index=index)
print(data)

Output:

              C  D
first second
rar   six     1  0
raz   seven   1  1
bal   six     1  2
bac   seven   1  3
foa   six     2  4
      seven   2  5
qus   six     3  6
      seven   3  7

Group by second and C and then calculate the sum using the sum function.

import pandas as pd
import numpy as np

arrays = [
    ["rar", "raz", "bal", "bac", "foa", "foa", "qus", "qus"],
    ["six", "seven", "six", "seven", "six", "seven", "six", "seven"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
data = pd.DataFrame({"C": [1, 1, 1, 1, 2, 2, 3, 3], "D": np.arange(8)}, index=index)
output = data.groupby(["second", "C"]).sum()
print(output)

Output:

Groupby by second

Use the groupby() Function on a CSV File Data in Python Pandas

Now use a groupby() function on a CSV file. To download the CSV file used in the code, click [here](Students Performance in Exams | Kaggle).

The CSV file used is about student performance. To group the data based on gender, use the groupby() function.

The read_csv function of the Python Pandas library is used to read the file from your drive. Store the file in a data variable.

import pandas as pd

data = pd.read_csv("/content/drive/MyDrive/CSV/StudentsPerformance.csv")
print(data)

Output:

Dataset of the CSV file

Apply groupby() function.

import pandas as pd

data = pd.read_csv("StudentsPerformance.csv")
std = data.groupby("gender")
print(std.first())

Group by gender

Let us print the value in any of the groups. For this, use the team’s name.

The function get_group is used to find the entries in any group. Find the value contained in the female group.

import pandas as pd

data = pd.read_csv("StudentsPerformance.csv")
std = data.groupby("gender")
print(std.get_group("female"))

Output:

Set group by female value

Use the groupby() function to create more than one category group. To split, use more than one column.

import pandas as pd

data = pd.read_csv("StudentsPerformance.csv")
std_per = data.groupby(["gender", "lunch"])
print(std_per.first())

Output:

Group by Gender Lunch

Groupby() is a versatile function with numerous variants. It makes splitting the DataFrame based on certain criteria very simple and efficient.

Related Article - Pandas Groupby