This tutorial explores using a scatter matrix in Pandas for pairing plots.

Scatter Matrix in Pandas

It is important to check for correlation among independent variables used in analyzing regression during data preprocessing. Scatter plots make it very easy to understand the correlation between the features.

Pandas provides analysts with the scatter_matrix() function to feasibly achieve these plots. It’s also used to determine whether the correlation is positive or negative.

Let us consider an example of n variables; this function in Pandas will help us have n rows and n columns that are n x n matrix.

Three simple steps to be followed to achieve scatter plots are given below.

Load the necessary libraries.

Import the data that fits correctly.

Use the scatter_matrix method to plot the graph.

Syntax:

pandas.plotting.scatter_matrix(dataframe)

This tutorial will teach us how to efficiently use scatter_matrix() as an analyst.

As we can see, we can produce these plots with such ease. But, what makes it so interesting?

The distribution of the variables x1, x2 and x3 in our dummy data is portrayed.

Correlation between variables can be observed.

Use the scatter_matrix() Method With hist_kwds Parameter in Pandas

The next example uses the hist_kwds parameter. We can use this parameter to give input in the form of a Python dictionary, through which we can change the total count of bins for the histograms.

# Changing the number of bins of the scatter matrix in Python:pd.plotting.scatter_matrix(df, hist_kwds={'bins':30})

Output:

Use the scatter_matrix() Method With diagonal = 'kde' Parameter in Pandas

We will replace histograms with a kde distribution in the last example.

KDE stands for Kernel Density Estimation. It is a rudimentary tool that can smoothen the data, after which inferences can be made based on a finite data sample.

Achieving scatter plots with kde is as easy as making a histogram. To do this, we just need to replace hist_kwds with diagonal = 'kde'.

The diagonal parameter cannot consider two arguments: hist and kde. It is important to ensure that either is used in the code.

The changes in the code to get kde are as follows.

# Scatter matrix with Pandas and density plots:pd.plotting.scatter_matrix(df, diagonal='kde')

We can also plot charts on readily available data instead of using dummy data.

We are only required to import the CSV file using the Python Pandas module through the read_csv method.

csv_file = ('URL for the dataset')
# Reading the CSV file from the URLdf_s = pd.read_csv(csv_file, index_col=0)
# Checking the data quickly (first 5 rows):df_s.head()

Like scatter_matrix() in Pandas, one can also use the pairplot method that is usable through the seaborn package.

An in-depth understanding of these modules can help plot these scatter plots; it also gives an upper hand to make it more user-friendly and create more attractive visualizations.

Preet writes his thoughts about programming in a simplified manner to help others learn better. With thorough research, his articles offer descriptive and easy to understand solutions.