How to Perform Stratified Sampling in Pandas

Preet Sanghavi Feb 02, 2024
  1. Stratified Sampling in Statistics
  2. Perform Stratified Sampling in Pandas
How to Perform Stratified Sampling in Pandas

The following tutorial will teach how to perform stratified sampling in pandas on a data frame.

Stratified Sampling in Statistics

Stratified sampling is a strategy for obtaining samples representative of the population. Separating the population into homogeneous groupings called strata and randomly sampling data from each stratum decreases bias in sample selection.

When the mean values of each stratum differ, stratified sampling is employed in Statistics. Stratified sampling is frequently used in machine learning to construct test datasets for evaluating models, mainly when a dataset is vast and uneven.

Perform Stratified Sampling in Pandas

The first step in performing the stratified sampling would be importing the Pandas library.

import pandas as pd

Let us now learn the steps involved in stratified sampling.

  1. Separate the population into strata. The population is sorted into strata based on comparable traits in this stage, and each individual must belong to exactly one stratum.
  2. Determine the sample size. We need to decide whether our sample will be large or small at this stage.
  3. Randomly sample each stratum. Disproportionate sampling, in which the sample size of each stratum is equal regardless of its population size, or Proportionate sampling, in which the sample size of every stratum is proportional to its population size, is used to select random samples from each stratum.

We will now consider a sample and perform disproportionate and proportionate stratified sampling. Out of 10 students, we will sample 6 students based on their grades.

Let us first create a sample data frame to work on. Here we will take 4 columns, including name, id, grade, and category.

We will create this data frame using the code below.

students = {
    "Name": [
        "sanay",
        "shivesh",
        "rutwik",
        "preet",
        "yash",
        "mann",
        "pritesh",
        "hritesh",
        "raj",
        "tarun",
    ],
    "ID": ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"],
    "Grade": ["A", "A", "C", "B", "B", "B", "C", "A", "A", "A"],
    "Category": [2, 3, 1, 3, 2, 3, 3, 1, 2, 1],
}
df = pd.DataFrame(students)
print(df)

Output:

      Name   ID Grade  Category
0    sanay  001     A         2
1  shivesh  002     A         3
2   rutwik  003     C         1
3    preet  004     B         3
4     yash  005     B         2
5     mann  006     B         3
6  pritesh  007     C         3
7  hritesh  008     A         1
8      raj  009     A         2
9    tarun  010     A         1

It’s worth noting that 50 percent of the kids are in grade A, 30 percent are in grade B, and 20 percent are in grade C. We will now perform disproportionate sampling, creating a sample of 6 students.

For disproportionate sampling, separate the students into groups depending on their grade, i.e., A, B, C, then use the sample function to sample 2 students from each grade group randomly. We do this using the below code.

df.groupby("Grade", group_keys=False).apply(lambda x: x.sample(2))

Output:

      Name   ID Grade  Category
0    sanay  001     A         2
7  hritesh  008     A         1
5     mann  006     B         3
4     yash  005     B         2
2   rutwik  003     C         1
6  pritesh  007     C         3

For proportionate sampling, separate the students into groups depending on their grade, i.e., A, B, C, then take a random sample from each group based on population percentage using Pandas groupby(). The overall sample size is 60% of the population (0.6).

We perform this using the below code.

df.groupby("Grade", group_keys=False).apply(lambda x: x.sample(frac=0.6))

Output:

      Name   ID Grade  Category
7  hritesh  008     A         1
9    tarun  010     A         1
0    sanay  001     A         2
3    preet  004     B         3
5     mann  006     B         3
6  pritesh  007     C         3

Therefore, we can successfully perform proportionate and disproportionate sampling on a data frame in Pandas using the above approaches.

Preet Sanghavi avatar Preet Sanghavi avatar

Preet writes his thoughts about programming in a simplified manner to help others learn better. With thorough research, his articles offer descriptive and easy to understand solutions.

LinkedIn GitHub

Related Article - Pandas Statistics