Pandas DataFrame DataFrame.sample() Function

Minahil Noor Jan 30, 2023
  1. Syntax of pandas.DataFrame.sample()
  2. Example Codes: DataFrame.sample()
  3. Example Codes: DataFrame.sample() to Extract the Columns
  4. Example Codes: DataFrame.sample() to Generate a Fraction of Data
  5. Example Codes: DataFrame.sample() to Oversample the DataFrame
  6. Example Codes: DataFrame.sample() With weights
Pandas DataFrame DataFrame.sample() Function

Python Pandas DataFrame.sample() function generates a sample of a random row or a column from a DataFrame. The sample can contain more than one row or column.

Syntax of pandas.DataFrame.sample()

DataFrame.sample(
    n=None, frac=None, replace=False, weights=None, random_state=None, axis=None
)

Parameters

n It is an integer. It represents the random number of the rows or columns to be selected from the DataFrame.
frac It is a float value. It specifies the percentage of random rows or columns to be extracted from the DataFrame. For example, frac=0.45 means that the random rows or columns selected will be 45% of the original data.
replace It is a boolean value. If it is set to True then it returns the sample with the replacement of data.
weights It is a string or an N-dimensional array-like structure. If it is called on a DataFrame then it accepts the name of a column when the axis is 0. The rows with values greater in weights column are more likely to be returned as the sample data.
random_state It is an integer or numpy.random.RandomState function. If it is an integer then it returns the same number of rows or columns in every iteration. Otherwise, it returns a numpy.random.RandomState object.
axis It is an integer or a string. It specifies the target axis either rows or columns. It can be 0 or index and 1 or columns.

Return

It returns a Series or a DataFrame. The returned Series or DataFrame is a caller that contains n items selected randomly from the original DataFrame.

Example Codes: DataFrame.sample()

By default, the function returns a sample containing rows i.e axis=1.

import pandas as pd

dataframe=pd.DataFrame({'Attendance': {0: 60, 1: 100, 2: 80,3: 75, 4: 95},
                    'Name': {0: 'Olivia', 1: 'John', 2: 'Laura',3: 'Ben',4: 'Kevin'},
                    'Obtained Marks': {0: 56, 1: 75, 2: 82, 3: 64, 4: 67}})
print(dataframe)

Our DataFrame is as below.

   Attendance    Name  Obtained Marks
0          60  Olivia              56
1         100    John              75
2          80   Laura              82
3          75     Ben              64
4          95   Kevin              67

All the parameters of this function are optional. If we execute this function without passing any parameter, it returns a single random row as an output.

import pandas as pd

dataframe=pd.DataFrame({'Attendance': {0: 60, 1: 100, 2: 80,3: 75, 4: 95},
                    'Name': {0: 'Olivia', 1: 'John', 2: 'Laura',3: 'Ben',4: 'Kevin'},
                    'Obtained Marks': {0: 56, 1: 75, 2: 82, 3: 64, 4: 67}})
dataframe1 = dataframe.sample()
print(dataframe1)

Output1:

   Attendance Name  Obtained Marks
3          75  Ben              64

Output2:

   Attendance   Name  Obtained Marks
4          95  Kevin              67

Outpt1 and output2 show the execution of the same program twice. Every time this function generates a random sample of rows from the given DataFrame.

Example Codes: DataFrame.sample() to Extract the Columns

To generate columns in a sample we will simply change our axis to 1.

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(n=1, axis=1)
print(dataframe1)

Output:

     Name
0  Olivia
1    John
2   Laura
3     Ben
4   Kevin

The function has generated a sample of a single column as an output. The number of columns was set by the parameter n=1.

Example Codes: DataFrame.sample() to Generate a Fraction of Data

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(frac=0.5)
print(dataframe1)

Output:

   Attendance   Name  Obtained Marks
3          75    Ben              64
4          95  Kevin              67
1         100   John              75

The returned sample is 50% of the original data.

Example Codes: DataFrame.sample() to Oversample the DataFrame

If frac>1, then the parameter replace should be True to allow the same row could be sampled more than once; otherwise, it will raise a ValueError.

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(frac=1.5, replace=True)
print(dataframe1)

Output:

   Attendance   Name  Obtained Marks
3          75     Ben              64
0          60  Olivia              56
1         100    John              75
2          80   Laura              82
1         100    John              75
2          80   Laura              82
0          60  Olivia              56
4          95   Kevin              67

If replace is set to be False meanwhile frac is larger than 1, than it raises a ValueError.

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(frac=1.5, replace=False)
print(dataframe1)

Output:

Traceback (most recent call last):
  File "..\test.py", line 6, in <module>
    dataframe1 = dataframe.sample(frac=1.5, replace=False)
  File "..\lib\site-packages\pandas\core\generic.py", line 5044, in sample
    raise ValueError(
ValueError: Replace has to be set to `True` when upsampling the population `frac` > 1.

Example Codes: DataFrame.sample() With weights

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(n=2, weights="Attendance")
print(dataframe1)

Output:

   Attendance   Name  Obtained Marks
1         100   John              75
4          95  Kevin              67

Here, the rows with greater values in the Attendance column are selected in the returned sample.

Related Article - Pandas DataFrame