- Need for Cross-Validation in Python
- K-Fold Cross-Validation in Python
sklearn.model_selection.KFoldClass to Implement K-Fold in Python
In Python, we have a lot of machine learning algorithms. We can have supervised and unsupervised learning models trained and tested using a dataset before using the algorithm on real-time data.
Need for Cross-Validation in Python
The traditional approach to training and testing a model involves splitting the dataset into the train and test datasets. Usually, this ratio of the train to test is 70:30.
There are some drawbacks with directly splitting the dataset.
The main drawback is related to overfitting. To get the best performance from a model, we tweak the hyperparameters accordingly to improve the model’s performance on the test data.
However, while repeatedly changing the hyperparameters, we leak knowledge into the model, increasing the risk of overfitting the test data.
People started using training, test, and validation datasets to counter this.
We tune the hyperparameters using the training and validating dataset.
However, the number of samples for learning the model is significantly affected and reduced.
In recent times, people started using the K-Fold cross-validation technique to counter the earlier problems. This is an essential technique as it can help tune the model to choose the model with the best performance.
K-Fold Cross-Validation in Python
The data is split without shuffling into K consecutive folds. Now, every fold is used once for validation, while the remaining folds form the training set (K - 1).
In short, one part of the training set is for validation.
Let us discuss this in detail.
We divide the dataset into two - training and testing, and out of this, the training dataset further divides into K-Folds. One part from this is for validating, while the rest are for training.
The hyperparameters and performance of this model are noted. These steps are repeated until each split validates the dataset.
The performance of the model and the mean standard deviation are noted for each fold. This is repeated for different hyperparameter values, and the best-performing model is selected.
sklearn.model_selection.KFold Class to Implement K-Fold in Python
We can use the
sklearn module to implement different machine learning algorithms and techniques in Python. The
model_selection.KFold class can implement the K-Fold cross-validation technique in Python.
KFold class, we specify the folds with the
n_splits parameter, 5 by default.
We can also provide the
shuffle parameter, determining whether to shuffle data before splitting. It is
False by default.
random_state parameter also controls each fold’s randomness. For this,
shuffle needs to be set to
We’ll use an instance of this class with a simple numpy array.
We will provide the array with the
split() function. This method will return the indices for the dataset.
from sklearn.model_selection import KFold import numpy as np x = [10,20,30,40,50,60,70,80,90,100,110,120] k_fold = KFold(n_splits=3) for indices_train, indices_test in k_fold.split(x): print(indices_train, indices_test)
[ 4 5 6 7 8 9 10 11] [0 1 2 3] [ 0 1 2 3 8 9 10 11] [4 5 6 7] [0 1 2 3 4 5 6 7] [ 8 9 10 11]
In the above example, we set the number of folds to 3.
When working with large datasets, we usually set the value of K as 5. The value of K tends to increase as the dataset gets smaller.
One should note that before sklearn 0.20, the KFold class was a part of the
sklearn.cross_validation module. There are other techniques also available for cross-validation of data in Python.
For small datasets, we tend to use the LOOCV technique. Other improved versions of K-Fold are the
ShuffleSplit is also a common technique used for cross-validation, and we have