Bin Data Using SciPy, NumPy and Pandas in Python

Zeeshan Afridi Oct 10, 2023
  1. Binning in Python
  2. Importance of Data Binning
  3. Different Ways to Bin Data in Python
Bin Data Using SciPy, NumPy and Pandas in Python

With the exponential growth of data and use cases, data binning or categorizing becomes necessary to make sense of this data.

Regarding data binning, different techniques are available, like data clustering or more classical statistical techniques like regression analysis.

We will see why you need data binning and which technique is best suited for which context.

Binning in Python

Binning is one of the most powerful analytical techniques to infer the relationship of different variables.

Binning is a non-parametric and highly flexible technique where the variables are categorized into different sets to reveal patterns and trends. It is widely applicable to various data sets and tiny sample sizes.

Binning is a process of grouping data into bins. It can be done for various purposes, such as to group data points by range, group data points by density, or group data points by similarity.

There are various ways to bin data in python, such as using the numpy.digitize() function, pandas.cut() function, and using the scipy.stats.binned_statistic() function.

Every method has pros and cons, so choosing the suitable method for the task is essential.

Importance of Data Binning

Data binning is a simple concept: classifying data for more straightforward analysis. For example, you might have several large data tables in a CSV, and you want to break the data into smaller chunks.

Data binning allows you to put the data into different groups so you can better analyze it, and we can also use it to create pretty visualizations.

So, why is data binning necessary? First, data binning is essential because it helps you analyze your data better. For example, you can split an entire data table into smaller chunks that are easier to understand or visualize.

Data binning can help you find patterns in the data and make it easier to identify outliers. It allows you to take a massive data set and make it more manageable to get to the meat of the problem.

Data binning is a process of subdividing a continuous variable into discrete bins. As a rough example, if you have a patient’s temperature variable, you can bin the temperature into five bins (say, < 36.5, 36.5–37.5, 37.5–38.5, 38.5–39.5 and > 39.5).

This advantage is that you can visualize the variable in a histogram or box plot using the bin ranges.

Different Ways to Bin Data in Python

There are several ways to bin data in Python, but using the SciPy and NumPy libraries is arguably the most efficient.

Use SciPy and NumPy to Bin Data in Python

To start with SciPy and NumPy, let’s say you have a list of data points you want to bin. The first step is to import the SciPy and NumPy libraries:

import numpy as np
import scipy as sp

Next, you’ll need to define the edges of the bins. It can be done using the linspace function:

bin_edges = np.linspace(start, stop, num=num_bins)

Where start & stop are the minimum & maximum values of the data, respectively, and num_bins is the bins’ number you want to create. Finally, you can use the SciPy histogram function to bin the data:

binned_data = sp.histogram(data, bin_edges)

The binned_data variable will now contain a tuple with two elements. The first element is an array of the binned data, and the second is an array of the bin edges.

Use Numpy to Bin Data in Python

Code Example:

# import Numpy library
import numpy

# define the edges of bin
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)

# finally, bin the data using numpy
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
bins = numpy.linspace(0, 1, 10)

digitized = numpy.digitize(data, bins)
[data[digitized == i].mean() for i in range(1, len(bins))]

Output:

[0.05308461260140375,
 0.16559348769870028,
 0.28950800899648155,
 0.3874228665181473,
 0.5046647094141071,
 0.6254841134474202,
 0.7216935463408317,
 0.8374773268113803,
 0.9421576008815353]

Use Pandas to Bin Data in Python

Code Example:

# import libraries
import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100) + 10})

# will Bin the data frame by "a" in 10 bins
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the b mean that the values will bin
print(groups.mean().b)

Output:

a
(0.00762, 0.117]    10.576639
(0.117, 0.226]      10.319629
(0.226, 0.335]      10.633805
(0.335, 0.444]      10.404979
(0.444, 0.553]      10.551616
(0.553, 0.662]      10.420306
(0.662, 0.771]      10.434091
(0.771, 0.88]       10.402038
(0.88, 0.989]       10.537547
Name: b, dtype: float64

Use SciPy to Bin Data in Python

Code Example:

# import libraries
import numpy as np
from scipy import stats

# define array
arr = [20, 2, 7, 1, 34]
print("\narr : \n", arr)

# start binning
print(
    "\nbinned_statistic for median : \n",
    stats.binned_statistic(arr, np.arange(5), statistic="median", bins=4),
)

Output:

Array = [20, 2, 7, 1, 34]

Binned statistics for median

BinnedStatisticResult(statistic=array([ 2., nan,  0.,  4.]), bin_edges=array([ 1.  ,  9.25, 17.5 , 25.75, 34.  ]), binnumber=array([3, 1, 1, 1, 4], dtype=int64))
Zeeshan Afridi avatar Zeeshan Afridi avatar

Zeeshan is a detail oriented software engineer that helps companies and individuals make their lives and easier with software solutions.

LinkedIn