11 Як кластеризація підвищує точність моделей “з учителем” (kmeans + renadomForest)

Курс: Математичне моделювання в R

Увага. Для використання подібного підходу до підвищення точності прогнозів варто переконатися, що це працює.

# install.packages("carData")

library(randomForest)
library(cluster)
library(modelr)
library(dplyr)

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'dplyr'


The following object is masked from 'package:randomForest':

    combine


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Sys.setlocale("LC_CTYPE", "ukrainian")
options(warn = -1)

'Ukrainian_Ukraine.1251'

11.1 Набір даних

Детальна інформація про набір даних вже описана у матеріалі “Дерева рішень. Регресія. Баланс кредитної карти” (Примітка. Додати лінк)

data <- read.csv("data/credit_card_balance.csv")

Переглянемо структуру даних:

str(data)

'data.frame':   400 obs. of  12 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Income   : num  14.9 106 104.6 148.9 55.9 ...
 $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
 $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
 $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
 $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
 $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
 $ Gender   : chr  "Male" "Female" "Male" "Female" ...
 $ Student  : chr  "No" "Yes" "No" "No" ...
 $ Married  : chr  "Yes" "Yes" "No" "No" ...
 $ Ethnicity: chr  "Caucasian" "Asian" "Asian" "Asian" ...
 $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

Підготуємо дані до моделювання. Перетворимо категоріальні показники до факторів:

data$X <- NULL
data$Gender <-  factor(data$Gender)
data$Student <- factor(data$Student)
data$Married <- factor(data$Married)
data$Ethnicity <- factor(data$Ethnicity)

11.2 Тренувальна та тестова вибірки

Розділимо на тестову та тренувальну вибірки:

set.seed(2) 
train_index <- sample(nrow(data), size = 0.5*nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
nrow(train_data)
nrow(test_data)

200

11.3 Побудова моделі на основі RandomForest

rf_m1 <-randomForest(Balance~.,data=train_data)
summary(rf_m1)

                Length Class  Mode     
call              3    -none- call     
type              1    -none- character
predicted       200    -none- numeric  
mse             500    -none- numeric  
rsq             500    -none- numeric  
oob.times       200    -none- numeric  
importance       10    -none- numeric  
importanceSD      0    -none- NULL     
localImportance   0    -none- NULL     
proximity         0    -none- NULL     
ntree             1    -none- numeric  
mtry              1    -none- numeric  
forest           11    -none- list     
coefs             0    -none- NULL     
y               200    -none- numeric  
test              0    -none- NULL     
inbag             0    -none- NULL     
terms             3    terms  call

test_prediction1 <- round(predict(object=rf_m1, test_data))
head(test_prediction1)

2: 1038
4: 1331
5: 433
6: 1010
7: 182
10: 1224

Перевіримо детермінацію та похибки

rsquare(rf_m1, data = train_data)
rsquare(rf_m1, data = test_data)

0.973839956466162

0.883183673751305

rmse(rf_m1, data = train_data)
rmse(rf_m1, data = test_data)

76.9630101327789

150.63563844498

11.4 Кластеризуємо числові дані

Для початку об’єднаємо усі дані в один датасет, памятаємо, що пропорція розбиття 280/120 значень:

data <- train_data |> bind_rows(test_data)

Згенеруємо, наприклад, 5 кластерів длише для числових показників без параметру Balance (чому без балансу?):

data_k <- data |>
    select(-Balance) |>
    select_if(is.numeric)
head(data_k)

A data.frame: 6 × 6
	Income	Limit	Rating	Cards	Age	Education
	<dbl>	<int>	<int>	<int>	<int>	<int>
1	27.794	3807	301	4	35	8
2	50.699	3977	304	2	84	17
3	180.379	9310	665	3	67	8
4	73.327	6555	472	2	43	15
5	30.413	3690	299	2	25	15
6	13.433	1134	112	3	70	14

set.seed(2)
clusters <- kmeans(data_k, 4)
clusters$cluster

Додамо кластери як фактори до даних:

data <- data |>
    mutate(cluster = clusters$cluster)

Розібємо знову на тестову та тренувальну вибірки:

train_data <- data[1:200,]
test_data <- data[201:400,]

rf_m2 <-randomForest(Balance~.,data=train_data)
summary(rf_m2)

                Length Class  Mode     
call              3    -none- call     
type              1    -none- character
predicted       200    -none- numeric  
mse             500    -none- numeric  
rsq             500    -none- numeric  
oob.times       200    -none- numeric  
importance       11    -none- numeric  
importanceSD      0    -none- NULL     
localImportance   0    -none- NULL     
proximity         0    -none- NULL     
ntree             1    -none- numeric  
mtry              1    -none- numeric  
forest           11    -none- list     
coefs             0    -none- NULL     
y               200    -none- numeric  
test              0    -none- NULL     
inbag             0    -none- NULL     
terms             3    terms  call

test_prediction2 <- round(predict(object=rf_m2, test_data))
head(test_prediction2)

201: 1026
202: 1352
203: 423
204: 1036
205: 173
206: 1211

rsquare(rf_m1, data = train_data)
rsquare(rf_m1, data = test_data)

0.973839956466162

0.883183673751305

rsquare(rf_m2, data = train_data)
rsquare(rf_m2, data = test_data)

0.974638302729307

0.892714294123845

rmse(rf_m1, data = train_data)
rmse(rf_m1, data = test_data)

76.9630101327789

150.63563844498

rmse(rf_m2, data = train_data)
rmse(rf_m2, data = test_data)

75.7779193630034

144.501399436682