R の KNN 分類モデルに相互検証を使用する

Jesse John 2023年6月21日

R R Validation

さまざまな相互検証アプローチ
K 最近傍分類モデルの反復 K 分割交差検証

クロス検証により、トレーニングデータセットしかない場合でも、新しいデータに対するモデルのパフォーマンスを評価できます。これは、回帰および分類モデルに適用できる一般的な手法です。

この記事では、K 最近傍 (KNN) 分類モデルに対して k 分割反復交差検証を実行する方法について説明します。この目的のためにキャレットパッケージを使用します。

KNN の K は、観測の近傍数を表します。一方、k-fold の k は、トレーニングデータのサブセットの数です。

さまざまな相互検証アプローチ

交差検証にはさまざまなアプローチがあります。

最も基本的なバージョンでは、トレーニングデータの 1つのサブセットを使用してモデルを検証します。これは検証セットアプローチと呼ばれます。モデルは 1 回だけ適合され、サブセットでテストされます。

もう 1つは、観測と同じ数のモデルを当てはめ、平均エラー率を取得することです。各モデルには、1つの観測値が除外されています。次に、モデルはその 1つの観測値でテストされます。

これは、Leave One Out Cross Validation (LOOCV) と呼ばれます。

最も役立つアプローチは次のとおりです。

トレーニングデータセットをkフォールド（グループ）に分割し、
モデルを k 回フィッティングし、
折り目を 1つ残して、
その上でモデルをテストします。

これは、k 分割交差検証と呼ばれます。通常、k の値は 5 または 10 で良い結果が得られます。

k 分割交差検証の機能強化には、k 分割交差検証モデルを、折り畳みの異なる分割で数回フィッティングすることが含まれます。これは反復 k 分割交差検証と呼ばれ、これを使用します。

K 最近傍分類モデルの反復 K 分割交差検証

ここでは、反復 k 分割交差検証を使用して K 最近傍 (KNN) 分類モデルを適合させる方法を見ていきます。キャレットパッケージを使用します。

キャレットパッケージは汎用性が高く、いくつかのタイプのモデルの構築に使用できます。詳細については、CRAN に関するドキュメントを参照してください。

通常、k 分割交差検証は、モデルが新しいデータに対してどの程度正確であると期待されるかを示すだけです。

たとえば、3 回繰り返される 10 個の k 分割を使用して、繰り返し k 分割交差検証を使用して K = 5 KNN モデルを適合させるとします。モデルは 3つの異なるデータ分割のそれぞれに対して 10 回適合され、1つのモデル (K = 5 近傍を持つモデル) のみのパフォーマンスメトリックが取得されます。

上記に加えて、caret パッケージを使用すると、KNN モデルをさまざまな K の値に適合させることができます。次に、この関数は、最適なモデルとなる K の値を報告し、そのモデルを作成します。

createDataPartition() 関数は、因子ベクトルの層別ランダム分割を作成します。これを使用して、データをトレーニングサブセットとテストサブセットに分離し、モデルの精度を検証します。

train() 関数は、モデルを作成する main 関数です。

x は、予測変数を含むデータフレームです。
y は結果データフレームまたはベクトルです。
method 引数は、構築したいモデルのタイプを取ります。 knn を指定します。
preprocess には、scale と center を指定します。
trControl 引数により、交差検証手順の詳細を指定できます。
tuneGrid 引数は、複数のモデルの作成と比較に役立ちます。調整するパラメーターの名前を持つデータフレームが必要です。

KNN モデルを構築しているため、tuneGrid への調整パラメータとして小文字の k を指定します。 1 から 12 までの K 値のベクトルを提供します。これに対して、関数でモデルを作成およびテストする必要があります。

交差検証の詳細は、trainControl() 関数を使用して trControl 引数に渡されます。

method 引数には、交差検証を繰り返す必要があるため、repeatedcv を指定します。
method が cv または repeatedcv の場合、number 引数は折り畳み k を指定します。 10を使用します。
repeats 引数は、k 分割を何回繰り返さなければならないかを指定します。

コード例:

# Create the data vectors for the demonstration.
# We will create two numeric vectors as predictors.
# Each vector will have two distinct groups to suit our model.
# We will create a factor with two levels.
# The factor levels correspond to the groups in the predictor vectors.

set.seed(564)
vX1a = round(rnorm(100, 2,2))+4
set.seed(574)
vX2a = round(rnorm(100, 15,4))

set.seed(584)
vX1b = round(rnorm(100, 10,3))+5
set.seed(594)
vX2b = round(rnorm(100, 5,4))

vYa = rep("Blue", 100)
vYb = rep("Red", 100)

vX1 = c(vX1a, vX1b)
vX2 = c(vX2a, vX2b)
vY = c(vYa, vYb)

# Dummy column for ordering rows.
set.seed(528)
R = sample(1:200,200)

# Temporary data frame.
temp_df = data.frame(X1 = vX1, X2 = vX2, Y = as.factor(vY), R)

# Packages that we will use.
library(ggplot2)
library(dplyr)

# See the sample data.
temp_df %>% ggplot(aes(x=X1, y = X2, colour = Y)) + geom_point()

# Re-order the rows, just to see that the KNN model works with the rows jumbled up.
# Final data frame.
# Notice that the outputs are a factor vector.
fin_df = temp_df %>% arrange(R) %>% select(X1, X2, Y)
head(fin_df)
str(fin_df)

# Install the caret package if it is not already installed.
# To install, uncomment the next line and run it.
# install.packages("caret")

# Load the caret package.
library(caret)

# Split the data frame into a training set and test set.
# Create a list of row numbers in the training set.
# This function creates a stratified random sample of all the outcome classes.
set.seed(365)
training_row_index = createDataPartition(fin_df[,3], p=0.75, list=FALSE)

# Create training sets of the predictors and the corresponding outcomes.
trg_data = fin_df[training_row_index,1:2]
trg_class = fin_df[training_row_index,3]

# Create the test set of predictors and the outcomes that we will later use.
tst_data = fin_df[-training_row_index,1:2]
tst_class = fin_df[-training_row_index,3]

# Let us check if the sample is stratified:
table(tst_class)
# Obviously, the training sample will complement these numbers of the totals.

# We will build a K-Nearest neighbors model using repeated k-fold cross-validation.
# The arguments are described in the article.
mod_knn = train(x = trg_data,
                y = trg_class,
                method = "knn",
                preProcess = c("center", "scale"),
                tuneGrid = data.frame(k = c(1:12)),
                trControl = trainControl(method = "repeatedcv",
                                         number = 10,
                                         repeats = 3)
                )

# View the fitted model.
mod_knn

出力：

> head(fin_df)
  X1 X2    Y
1 15  6  Red
2 15  2  Red
3 14  3  Red
4  4 22 Blue
5 20 -3  Red
6  4 22 Blue

> str(fin_df)
'data.frame':	200 obs. of  3 variables:
 $ X1: num  15 15 14 4 20 4 15 2 20 13 ...
 $ X2: num  6 2 3 22 -3 22 7 16 9 -6 ...
 $ Y : Factor w/ 2 levels "Blue","Red": 2 2 2 1 2 1 2 1 2 2 ...

> # Let us check if the sample is really stratified:
> table(tst_class)
tst_class
Blue  Red
  25   25

> # View the fitted model.
> mod_knn
k-Nearest Neighbors

150 samples
  2 predictor
  2 classes: 'Blue', 'Red'

Pre-processing: centered (2), scaled (2)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 135, 136, 135, 135, 135, 134, ...
Resampling results across tuning parameters:

  k   Accuracy   Kappa
   1  0.9710317  0.9420024
   2  0.9736905  0.9473602
   3  0.9753373  0.9505141
   4  0.9842460  0.9683719
   5  0.9864683  0.9728764
   6  0.9843849  0.9687098
   7  0.9843849  0.9687098
   8  0.9800794  0.9600386
   9  0.9800794  0.9600386
  10  0.9800794  0.9600386
  11  0.9800794  0.9600386
  12  0.9800794  0.9600386

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

最良のモデルは K = 5 を使用することがわかります。

ベース R の predict() 関数と、table() 関数を使用して作成された混同行列を使用して、テストデータに対するモデルの精度を確認してみましょう。

コード例:

# Use model to predict classes for the test set.
pred_cls = predict(mod_knn, tst_data)

# Check the accuracy of the predictions by computing the confusion matrix.
table(Actl = tst_class, Pred = pred_cls)

出力：

> table(Actl = tst_class, Pred = pred_cls)
      Pred
Actl   Blue Red
  Blue   25   0
  Red     0  25

モデルが完全な精度でテストデータクラスを予測したことがわかります。これが可能になったのは、サンプルデータフレームでデータが適切に分離されていたためです。

実際には、精度は低くなります。ただし、モデルごとに k 分割交差検証手順を繰り返すことで、トレーニングデータと同様の新しいデータで期待できる精度を把握できます。

著者： Jesse John

Jesse is passionate about data analysis and visualization. He uses the R statistical programming language for all aspects of his work.