# Pandas cut() vs qcut() Functions

Binding continuous numeric data into various buckets for additional analysis is frequently useful when dealing with such data. Binning can also be called bucketing, discrete binning, discretization, or quantization.

## Pandas `cut()` Function

The array elements are divided into various bins using the Pandas `cut()` function. The `cut` function is primarily utilized for scalar data statistical analysis.

Syntax:

``````cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=”raise”,)
``````

## Pandas `qcut()` Function

`qcut()` is a Quantile-based discretization function, according to the Pandas’ description. Meaning that `qcut` makes an effort to create equal-sized bins from the underlying data. Instead of using the bins’ actual numerical edges, the function determines them using percentiles depending on how the data is distributed.

Syntax:

``````pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
``````

## Difference Between `cut()` and `qcut()` Functions

In short, is the key distinction between `cut()` and `qcut()`. Use `qcut()` to ensure that the items in your bins are distributed equally, and use `cut()` to create your own customized numeric bin ranges.

We are going to learn this difference in the example given below.

Code Example:

``````# import libraries
import numpy as np
import pandas as pd

# create a data frame
df = pd.DataFrame({
"column_x": np.random.randint(1, 50, size=50),
"column_y": np.random.randint(20, 100, size=50),
"column_z": np.random.random(size=50).round(2)
})
``````

Output:

``````      column_x  column_y  column_z
0         6        68      0.70
1        30        83      0.50
2        35        64      0.41
3        28        98      0.73
4         5        24      0.79
``````

In the first `2` columns, there are numbers in the ranges of `1` to `50` and `20` to `100`, respectively. Floats in the third column range from `0` to `1`, and we randomly generated these values using `numpy` routines.

Now, as we know that the `cut()` function distributes the entire value range into small bins, and the range covered by each bin will be the same. As a result, we assign different integers between `1` and `50` to the first column `(column x)`. Let’s check this column’s lowest and highest values first.

Code Example:

``````df.column_x.max(), df.column_x.min()
``````

Output:

``````(49, 3)
``````

If we divide this column into 5 equal parts, for instance, we will get the size of each bin as `9.2`, like the following.

\$\$
(49 - 3) / 5 = 9.2
\$\$

This binning process is carried out by the `cut()` function, which places each value in the appropriate bin.

Code Example:

``````df["column_x_binned"] = pd.cut(df.column_x, bins=5)
df.column_x_binned.value_counts()
``````

Output:

``````(21.4, 30.6]     16
(39.8, 49.0]     14
(12.2, 21.4]      8
(30.6, 39.8]      6
(2.954, 12.2]     6
``````

As you can see, every bin is exactly `9.2` inches in size, except for the tiniest. The bottom limits do not include anything.

To include it, the smallest bin’s lower bound must be somewhat less than the lowest value, `3`.

By manually specifying the bin boundaries, you can alter the appearance of the bins. The bins argument receives the edge values as a list.

Code Example:

``````pd.cut(df.column_x, bins=[0, 10, 40, 50]).value_counts()
``````

Output:

``````(10, 40]    33
(40, 50]    13
(0, 10]      4
``````

By default, the right edges are inclusive. However, this can be modified.

Code Example:

``````pd.cut(df.column_x, bins=[0, 10, 40, 50], right=False).value_counts()
``````

Output:

``````[10, 40)    33
[40, 50)    13
[0, 10)      4
``````

The values that fall into each bin values that fall into each bin while using the `cut()` function are completely out of your control. You are limited to defining the bin edges.

You must become familiar with the `qcut()` function at this point. The values can be divided into buckets so that roughly the same values are in each bucket.

Code Example:

``````pd.qcut(df.column_x, q=4).value_counts()
``````

Output:

``````(40.75, 49.0]    13
(19.5, 25.0]     13
(2.999, 19.5]    13
(25.0, 40.75]    11
``````

Each of our `4` buckets holds approximately the same values. The buckets are sometimes known as quartiles when there are four.

The first quartile contains one-fourth of the entire number of values, and the first two buckets contain fifty percent, and so on.

We do not control the bin edges with the `qcut()` function. They are automatically calculated.

Consider a column that contains `40` values (40 rows), and we wish to have `4` buckets. The upper range of the first bucket will be chosen so that it contains `10` values starting from the smallest value.

## Conclusion

A set of continuous values can be transformed into a discrete or categorical variable using either the `cut()` or `qcut()` functions.

The `cut()` function concerns the bins’ value range. The difference between the smallest and largest numbers is used to establish the whole range.

The entire range is then divided into the desired number of bins. By default, each bin is roughly the same size, and the only variable is the distance between the edges of the lower and upper bins.

The amount of values in each bin is the main focus of the `qcut()` function. The values are arranged in decreasing order of value.

Zeeshan is a detail oriented software engineer that helps companies and individuals make their lives and easier with software solutions.