How to Find K-Nearest Neighbors in MATLAB

Ammar Ali Feb 02, 2024
How to Find K-Nearest Neighbors in MATLAB

This tutorial will discuss finding the k-nearest neighbors using the knnsearch() function in MATLAB.

Find K-Nearest Neighbors Using knnsearch() in MATLAB

KNN, also known as k-nearest neighbors, is a classification algorithm used to find the k-nearest neighbors of a point in a data set. For example, if we have a data set containing the data of hospital patients and we want to find a person whose age and weight can be guessed.

We can pass the age and weight of all the patients present in the hospital and our guessed age and weight of the person we want to find in the KNN algorithm, and it will return the patients’ data that are the closest to the unknown person. We can use the knnsearch() function of MATLAB to do the above task.

We have to pass the age and weight of known patients as the first argument inside the knnsearch() function and the age and weight of an unknown person as the second argument. The function will return the index or row number from the data set closest to our unknown person.

For example, let’s use the hospital data set stored inside MATLAB and search for an unknown person according to his age and weight. See the code below.

clc

load hospital;
X_data = [hospital.Age hospital.Weight];
Y_data = [30 162];
Ind = knnsearch(X_data,Y_data);
hospital(Ind,:)

Output:

ans =

               LastName             Sex     Age    Weight    Smoker    BloodPressure      Trials
    HLE-603    {'HERNANDEZ'}        Male    36     166       false     120          83    {1×2 double}

In the above code, the hospital data set contains the name, gender, age, weight, blood pressure, and the smoking information of 100 patients. To see the content of the data set, we can open it by double-clicking it inside the workspace window.

In this example, we only used the age and weight parameters because we only know this information about the unknown person, but we can also use other parameters.

The KNN algorithm only returned one nearest neighbor, but we can also set the number of nearest neighbors using the K argument and define the number of nearest neighbors.

We can also set the method used to find the nearest neighbors using the NSMethod argument and, after that, define the method name like euclidean, cityblock, or chebyshev.

We can also change the method used to find the distance between points, which is set to euclidean by default using the Distance argument; after that, we can define the name of the method like seuclidean, cosine, or cityblock.

By default, the KNN algorithm uses 50 points as leaf nodes, but we can also change it using the BucketSize argument and pass the number of points. The KNN algorithm makes clusters from the given data, and if we increase the bucket size, there will be fewer clusters with more points.

The indices returned by the knnseach() function are sorted by default. But, we can also get the original order of indices by turning OFF the sorting process using the SortIndices argument; after that, we need to pass false.

For example, let’s change the properties discussed above and see the result. See the code below.

clc

load hospital;
X_data = [hospital.Age hospital.Weight hospital.Smoker];
Y_data = [30 162 true];
Ind = knnsearch(X_data,Y_data,'K',2,'NSMethod','euclidean','Distance','chebychev','SortIndices',false);
hospital(Ind,:)

Output:

ans =

               LastName             Sex
    HLE-603    {'HERNANDEZ'}        Male
    VRH-620    {'MITCHELL' }        Male


               Age    Weight    Smoker
    HLE-603    36     166       false
    VRH-620    39     164       true


               BloodPressure
    HLE-603    120          83
    VRH-620    128          92


               Trials
    HLE-603    {1×2 double}
    VRH-620    {1×0 double}

In the above code, we have included another parameter, Smoker, from the data set, considering whether we know the unknown person is a smoker or not. We can see in the output that there are now two patients who are close to the unknown person’s data.

In the above examples, we only checked the nearest neighbors of one person, but we can find the nearest neighbors of multiple persons as well. The above properties might change the result depending on the data set.

The knnsearch() function finds the k-nearest points, but if we want to find all the nearest points within a specific distance to the given point, we can use the rangesearch() function in MATLAB. Check this link for more details about the rangesearch() function.

The problem with using the knnsearch() function is that it will take some time in a large data set depending on the machine where the code is running. But in machine learning, we want our code to be really fast, so we split the process into training and testing.

We train a model on the given dataset in the training process, which takes some time. We save the trained model, and when we want to predict the output from input, we can use the pre-trained model to predict the outcome in seconds.

To train a model using the KNN classifier, we can use the fitcknn() function to train a model, and then we can use the predict() function to predict the output for new input.

For example, let’s use the flowers data set to train a model using the KNN classifier and then the predict() function to predict the flower class. See the code below.

clc

load fisheriris
X_data = meas;
Y_data = species;
MyModel = fitcknn(X_data,Y_data,'NumNeighbors',6,'Standardize',1)
X_new = 1;
class_name = predict(MyModel,X_new)

Output:

MyModel =

  ClassificationKNN
             ResponseName: 'Y'
    CategoricalPredictors: []
               ClassNames: {'setosa'  'versicolor'  'virginica'}
           ScoreTransform: 'none'
          NumObservations: 150
                 Distance: 'euclidean'
             NumNeighbors: 6


  Properties, Methods


class_name =

  1×1 cell array

    {'versicolor'}

The X_data contains petal measurements for 150 irises of flowers in the above code, and Y_data has the corresponding iris or class name for the 150 irises. As we can see in the output, the model includes three class names and 150 observations for each class, and the method used to find the distance is euclidean, and the number of neighbors is 6.

We used the predict() function to predict the class name using the new observation, which is 1. We can also use multiple observations by creating a column vector of observations.

We can also change the method used to find the nearest neighbors, the method used to find the distance between points, and the bucket size in the same way we changed them in the case of the knnsearch() function.

We can also get two other outputs from the predict() function: the prediction score and the class name’s expected cost. We can also save the model that we have trained using the save command and load it back anytime using the load command in MATLAB.

The save command will create a .mat file, which will contain the model that we trained inside the current directory of MATLAB, and if we want to load it back, the .mat file should be present in the current directory being used by MATLAB.

The basic syntax for the save and load commands is below.

save model_name
load model_name

The first input argument of the fitcknn() function is a table containing the observations, and the second argument contains the class names that we want to predict. It should be a categorical, string, logical, numeric, cell, or character array.

Check this link for more details about the fitcknn() function.

Author: Ammar Ali
Ammar Ali avatar Ammar Ali avatar

Hello! I am Ammar Ali, a programmer here to learn from experience, people, and docs, and create interesting and useful programming content. I mostly create content about Python, Matlab, and Microcontrollers like Arduino and PIC.

LinkedIn Facebook

Related Article - MATLAB Function