3  Лінійна регресія. Прокат велосипедів

Курс: “Математичне моделювання в R”


Перевірити матеріали лекції!!!

Розділ про прогноз з кроком 10-30. Втрачено окремі комірки коду.

# install.packages("mice")

3.1 Dataset overview

Source: https://www.kaggle.com/c/bike-sharing-demand/data

Dataset description:

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

Data Fields:

  • datetime - hourly date + timestamp
  • season -
    • 1 = spring
    • 2 = summer
    • 3 = fall
    • 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - temperature in Celsius
  • atemp - “feels like” temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • casual - number of non-registered user rentals initiated
  • registered - number of registered user rentals initiated
  • count - number of total rentals (It is target variable. We will predict it!)
data <- read.csv("data/bikes.csv")
head(data)
A data.frame: 6 × 12
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
<chr><int><int><int><int><dbl><dbl><int><dbl><int><int><int>
12011-01-01 00:00:0010019.8414.395810.000031316
22011-01-01 01:00:0010019.0213.635800.000083240
32011-01-01 02:00:0010019.0213.635800.000052732
42011-01-01 03:00:0010019.8414.395750.000031013
52011-01-01 04:00:0010019.8414.395750.00000 1 1
62011-01-01 05:00:0010029.8412.880756.00320 1 1

Описова статистика факторів:

summary(data)
   datetime             season         holiday          workingday    
 Length:10886       Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
 Class :character   1st Qu.:2.000   1st Qu.:0.00000   1st Qu.:0.0000  
 Mode  :character   Median :3.000   Median :0.00000   Median :1.0000  
                    Mean   :2.507   Mean   :0.02857   Mean   :0.6809  
                    3rd Qu.:4.000   3rd Qu.:0.00000   3rd Qu.:1.0000  
                    Max.   :4.000   Max.   :1.00000   Max.   :1.0000  
    weather           temp           atemp          humidity     
 Min.   :1.000   Min.   : 0.82   Min.   : 0.76   Min.   :  0.00  
 1st Qu.:1.000   1st Qu.:13.94   1st Qu.:16.66   1st Qu.: 47.00  
 Median :1.000   Median :20.50   Median :24.24   Median : 62.00  
 Mean   :1.418   Mean   :20.23   Mean   :23.66   Mean   : 61.89  
 3rd Qu.:2.000   3rd Qu.:26.24   3rd Qu.:31.06   3rd Qu.: 77.00  
 Max.   :4.000   Max.   :41.00   Max.   :45.45   Max.   :100.00  
   windspeed          casual         registered        count      
 Min.   : 0.000   Min.   :  0.00   Min.   :  0.0   Min.   :  1.0  
 1st Qu.: 7.002   1st Qu.:  4.00   1st Qu.: 36.0   1st Qu.: 42.0  
 Median :12.998   Median : 17.00   Median :118.0   Median :145.0  
 Mean   :12.799   Mean   : 36.02   Mean   :155.6   Mean   :191.6  
 3rd Qu.:16.998   3rd Qu.: 49.00   3rd Qu.:222.0   3rd Qu.:284.0  
 Max.   :56.997   Max.   :367.00   Max.   :886.0   Max.   :977.0  

3.2 Minimal data preprocessing

Перевіримо вибірку на наявність пропусків за допомогою функції md.pattern() з пакету mice:

anyNA(data)
FALSE
suppressMessages(library(mice))
md.pattern(data, F)
 /\     /\
{  `---'  }
{  O   O  }
==>  V <==  No need for mice. This data set is completely observed.
 \  \|/  /
  `-----'
A matrix: 2 × 13 of type dbl
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
108861111111111110
0000000000000

There are no missing data! Great for us!

Перетворимо категоріальні змінні до факторів:

data$season <- factor(data$season)
data$holiday <- factor(data$holiday)
data$workingday <- factor(data$workingday)
data$weather <- factor(data$weather)
data$holiday <- factor(data$holiday)

head(data) # preview data
A data.frame: 6 × 12
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
<chr><fct><fct><fct><fct><dbl><dbl><int><dbl><int><int><int>
12011-01-01 00:00:0010019.8414.395810.000031316
22011-01-01 01:00:0010019.0213.635800.000083240
32011-01-01 02:00:0010019.0213.635800.000052732
42011-01-01 03:00:0010019.8414.395750.000031013
52011-01-01 04:00:0010019.8414.395750.00000 1 1
62011-01-01 05:00:0010029.8412.880756.00320 1 1

У наборі даних наявні 3 вихідних параметра: casual,registered,count, де $casual + registered = count$. Здійснимо прогнозcount`. Видалимо змінні, які використовуватися нами не будуть:

#data$casual <- NULL
#data$registered <- NULL

# or with dplyr
suppressMessages(library(dplyr))
data <- data %>% select(-c(casual, registered))

head(data)
A data.frame: 6 × 10
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcount
<chr><fct><fct><fct><fct><dbl><dbl><int><dbl><int>
12011-01-01 00:00:0010019.8414.395810.000016
22011-01-01 01:00:0010019.0213.635800.000040
32011-01-01 02:00:0010019.0213.635800.000032
42011-01-01 03:00:0010019.8414.395750.000013
52011-01-01 04:00:0010019.8414.395750.0000 1
62011-01-01 05:00:0010029.8412.880756.0032 1

3.3 Exploratory data analysis

Оцінимо звязки між числовими змінними та вихідним показником через кореляцію:

data %>%
 select(temp:count) %>% 
 cor() %>%
 as.data.frame()
A data.frame: 5 × 5
tempatemphumiditywindspeedcount
<dbl><dbl><dbl><dbl><dbl>
temp 1.00000000 0.98494811-0.06494877-0.01785201 0.3944536
atemp 0.98494811 1.00000000-0.04353571-0.05747300 0.3897844
humidity-0.06494877-0.04353571 1.00000000-0.31860699-0.3173715
windspeed-0.01785201-0.05747300-0.31860699 1.00000000 0.1013695
count 0.39445364 0.38978444-0.31737148 0.10136947 1.0000000

Щоб оглянути інформацію про частоту спостежень по сезонах варто скористатися записом:

data %>%
 group_by(season) %>%
 summarize(n = n())
A tibble: 4 × 2
seasonn
<fct><int>
12686
22733
32733
42734
library(ggplot2)
# llok like we have almost the same count by each season
ggplot(data, aes(season)) + 
    geom_bar(aes(fill = season)) + 
    theme_bw()

# Holidays plot
data %>%
 group_by(holiday) %>%
 summarize(n = n())

ggplot(data, aes(holiday)) + geom_bar(aes(fill = holiday)) + theme_bw()
A tibble: 2 × 2
holidayn
<fct><int>
010575
1 311

data %>%
 group_by(workingday) %>%
 summarize(n = n())

# Working dats plot
ggplot(data, aes(workingday)) + geom_bar(aes(fill = workingday)) + theme_bw()
A tibble: 2 × 2
workingdayn
<fct><int>
03474
17412

  • weather
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
# weather plot
ggplot(data, aes(weather)) + geom_bar(aes(fill = weather)) + theme_bw()

Переглянемо також залежність між числовими факторами та кількістю прокатів байків. Для температури це матиме запис:

# temp
plot(x=data$temp, y=data$count)

ggplot(data, aes(temp, count)) + geom_point() + theme_bw()

# atemp
ggplot(data, aes(atemp, count)) + geom_point() + theme_bw()

# humidity
ggplot(data, aes(humidity, count)) + geom_point() + theme_bw()

# windspeed
ggplot(data, aes(windspeed, count)) + geom_point() + theme_bw()

Переглянемо також загальний розподіл частоти замовлень байків на гістограмі:

ggplot(data, aes(count)) + geom_histogram(bins = 25, alpha = 0.5, fill = 'blue', color='black')  + theme_bw()

При спробі побудувати графік залежності між датою та кількістю замовлень байків ми отримаємо помилку або графік, що не відповідає порядку дат, оскільки на даний момент R не сприймає поле datetime як дату. Перетворимо дату за допомгою функції as.POSIXct():

data$datetime <- as.POSIXct(data$datetime)
ggplot(data, aes(datetime, count)) + geom_point() + theme_bw()

Чи можна прослідкувати зміну кількості замовлень в залежності від пори року з цього графіку?

Покращити вигляд графіка можна додавання параметрів у geom_point():

ggplot(data, aes(datetime, count)) + geom_point(aes(color=temp)) + theme_bw()

Переглянемо як час оренди байка впливає на кількість замовлень:

ggplot(data, aes(format(datetime, "%H"), count)) + 
        geom_point(aes(color=temp), alpha = 0.5) + 
        theme_bw()

Also, we can add new column to our dataset and use it:

data <- data %>%
    mutate(hour_rent = as.numeric(format(datetime, "%H")))
head(data)
A data.frame: 6 × 11
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcounthour_rent
<dttm><fct><fct><fct><fct><dbl><dbl><int><dbl><int><dbl>
12011-01-01 00:00:0010019.8414.395810.0000160
22011-01-01 01:00:0010019.0213.635800.0000401
32011-01-01 02:00:0010019.0213.635800.0000322
42011-01-01 03:00:0010019.8414.395750.0000133
52011-01-01 04:00:0010019.8414.395750.0000 14
62011-01-01 05:00:0010029.8412.880756.0032 15
# the same chart as before
ggplot(data, aes(hour_rent, count)) + 
        geom_point(aes(color=temp), alpha = 0.5) + 
        theme_bw()

3.4 Train test split

One more splitting method is based on date. So, we can use first 70% of dataset as train and 30% as test. This method often used for timeseries forecasting. We will use it for linear regression for now.

set.seed(10) 
nrow(data)

train_count = ceiling(0.7 * nrow(data))
train_count

test_count = nrow(data) - train_count
test_count
10886
7621
3265
data <- data %>% arrange(datetime) # sort by datetime
train_data <- data %>% slice_head(n = train_count)
nrow(train_data)
test_data <- data %>% slice_tail(n = test_count)
nrow(test_data)
7621
3265
# lets compare last from train and first from test
tail(train_data)
head(test_data) # ok!
A data.frame: 6 × 11
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcounthour_rent
<dttm><fct><fct><fct><fct><dbl><dbl><int><dbl><int><dbl>
76162012-05-16 16:00:00201129.5232.57545 7.001544616
76172012-05-16 17:00:00201129.5233.3355112.998087317
76182012-05-16 18:00:00201129.5233.3355115.001384618
76192012-05-16 19:00:00201128.7032.5755419.001259019
76202012-05-16 20:00:00201127.0631.0606516.997945920
76212012-05-16 21:00:00201126.2430.3057312.998039321
A data.frame: 6 × 11
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcounthour_rent
<dttm><fct><fct><fct><fct><dbl><dbl><int><dbl><int><dbl>
12012-05-16 22:00:00201126.2430.3057311.001428622
22012-05-16 23:00:00201125.4229.54578 7.001513323
32012-05-17 00:00:00201124.6028.79078 8.9981 79 0
42012-05-17 01:00:00201124.6028.0308311.0014 28 1
52012-05-17 02:00:00201124.6028.7907811.0014 16 2
62012-05-17 03:00:00201124.6029.5457316.9979 3 3

3.5 Model building

Побудуємо напройстішу модель залежності середньої температури за день та кількості орендованих байків:

lm_bike <- lm(count ~ atemp, train_data)
summary(lm_bike)

Call:
lm(formula = count ~ atemp, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-254.42  -93.91  -27.43   65.57  648.09 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.0774     4.4973  -0.462    0.644    
atemp         7.2602     0.1895  38.311   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 138.4 on 7619 degrees of freedom
Multiple R-squared:  0.1615,    Adjusted R-squared:  0.1614 
F-statistic:  1468 on 1 and 7619 DF,  p-value: < 2.2e-16

Спробуємо порахувати кількість орендованих байків для середніх темпертур від 10 до 30 градусів.

Створимо датафрейм з послідовністю елементів 10-30 з кроком 1 та порахуємо за формулою з інформації про модель прогнозовані показники:

Здійснимо прогноз за допомогою функції predict():

#predict_1030$count_model <- predict(lm_bike, newdata = predict_1030)
#predict_1030 %>% head()

# our results are very close, minimal difference is a result of rounding numbers in our formula
# how it looks like at chart
#ggplot(predict_1030, aes(atemp, count_model)) + 
 #       geom_point() + 
    #    theme_bw()

Побудуємо модель для усіх параметрів моделі окрім дати:

lm_bike <- lm(count~.-datetime, train_data)
summary(lm_bike)

Call:
lm(formula = count ~ . - datetime, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-283.67  -78.72  -25.68   46.27  587.17 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  15.67996    8.85332   1.771   0.0766 .  
season2       9.39674    4.59288   2.046   0.0408 *  
season3     -40.29356    6.46891  -6.229 4.95e-10 ***
season4      21.14562    4.40422   4.801 1.61e-06 ***
holiday1     -8.46508    8.79408  -0.963   0.3358    
workingday1  -4.89327    3.12276  -1.567   0.1172    
weather2      2.76675    3.47283   0.797   0.4257    
weather3    -31.30849    5.61763  -5.573 2.59e-08 ***
weather4     85.56066  122.62585   0.698   0.4854    
temp          1.34075    1.67470   0.801   0.4234    
atemp         6.30809    1.48446   4.249 2.17e-05 ***
humidity     -1.49856    0.08884 -16.868  < 2e-16 ***
windspeed     0.35086    0.19131   1.834   0.0667 .  
hour_rent     6.20760    0.21640  28.685  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 122.5 on 7607 degrees of freedom
Multiple R-squared:  0.3438,    Adjusted R-squared:  0.3427 
F-statistic: 306.6 on 13 and 7607 DF,  p-value: < 2.2e-16

Перевіримо модель на мультиколінеарність:

suppressMessages(library(car))
vif(lm_bike)
A matrix: 9 × 3 of type dbl
GVIFDfGVIF^(1/(2*Df))
season 2.94174431.197018
holiday 1.07616211.037382
workingday 1.07216311.035453
weather 1.29120731.043516
temp83.07373919.114480
atemp78.28136218.847676
humidity 1.59412411.262586
windspeed 1.33172311.154003
hour_rent 1.13543711.065569

Виключимо корельовані показники з моделі за принципом - залишити той, що має вищу кореляцію із залежним фактором. У нашому випадку це temp:

lm_bike <- lm(count~.-datetime-atemp, train_data)
summary(lm_bike)

Call:
lm(formula = count ~ . - datetime - atemp, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-297.02  -79.11  -25.60   46.35  585.12 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.97221    8.45456   3.190  0.00143 ** 
season2       9.25257    4.59790   2.012  0.04422 *  
season3     -45.22396    6.37114  -7.098 1.38e-12 ***
season4      22.00064    4.40455   4.995 6.02e-07 ***
holiday1    -11.34365    8.77778  -1.292  0.19629    
workingday1  -4.86349    3.12625  -1.556  0.11982    
weather2      3.02977    3.47616   0.872  0.38346    
weather3    -32.19037    5.62008  -5.728 1.06e-08 ***
weather4     88.33390  122.76150   0.720  0.47182    
temp          8.34447    0.29736  28.062  < 2e-16 ***
humidity     -1.47674    0.08879 -16.632  < 2e-16 ***
windspeed     0.06683    0.17945   0.372  0.70959    
hour_rent     6.20731    0.21665  28.652  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 122.7 on 7608 degrees of freedom
Multiple R-squared:  0.3422,    Adjusted R-squared:  0.3412 
F-statistic: 329.9 on 12 and 7608 DF,  p-value: < 2.2e-16

Далі побудуємо модель на основі алгоритму stepwise та порівняємо з поточними характеристиками моделі.

Створимо моделі для проходів у два боки:

start_mod <- lm(count ~ 1, data = train_data)
end_mod <- lm(count ~ . - datetime - temp, data = train_data)

Запустимо алгоритм функцією step():

lm_bike_stepwise <- step(start_mod,
                     scope = list(lower = start_mod, upper = end_mod),
                     direction = "both", trace = 1, steps = 1000)
Start:  AIC=76488.12
count ~ 1

             Df Sum of Sq       RSS   AIC
+ atemp       1  28114587 145942663 75148
+ hour_rent   1  27600736 146456514 75174
+ humidity    1  15503642 158553609 75779
+ season      3   9131936 164925314 76083
+ weather     3   3537078 170520172 76338
+ windspeed   1   2112392 171944858 76397
<none>                    174057250 76488
+ holiday     1     12170 174045081 76490
+ workingday  1      3511 174053739 76490

Step:  AIC=75147.53
count ~ atemp

             Df Sum of Sq       RSS   AIC
+ hour_rent   1  20603748 125338915 73990
+ humidity    1  15576942 130365721 74289
+ season      3   5178271 140764392 74878
+ weather     3   2989985 142952678 74996
+ windspeed   1   2889374 143053289 74997
<none>                    145942663 75148
+ workingday  1     34765 145907898 75148
+ holiday     1        44 145942619 75150
- atemp       1  28114587 174057250 76488

Step:  AIC=73989.67
count ~ atemp + hour_rent

             Df Sum of Sq       RSS   AIC
+ humidity    1   8093815 117245100 73483
+ season      3   2905720 122433195 73817
+ weather     3   2861080 122477834 73820
+ windspeed   1   1038961 124299954 73928
<none>                    125338915 73990
+ workingday  1     29661 125309254 73990
+ holiday     1       313 125338602 73992
- hour_rent   1  20603748 145942663 75148
- atemp       1  21117599 146456514 75174

Step:  AIC=73482.93
count ~ atemp + hour_rent + humidity

             Df Sum of Sq       RSS   AIC
+ season      3   2375796 114869304 73333
+ weather     3    534048 116711052 73454
+ workingday  1     32372 117212728 73483
<none>                    117245100 73483
+ windspeed   1     22790 117222310 73483
+ holiday     1      6750 117238350 73484
- humidity    1   8093815 125338915 73990
- hour_rent   1  13120621 130365721 74289
- atemp       1  22173444 139418544 74801

Step:  AIC=73332.92
count ~ atemp + hour_rent + humidity + season

             Df Sum of Sq       RSS   AIC
+ weather     3    525600 114343704 73304
+ workingday  1     42319 114826985 73332
+ windspeed   1     31234 114838071 73333
<none>                    114869304 73333
+ holiday     1      2238 114867066 73335
- season      3   2375796 117245100 73483
- humidity    1   7563890 122433195 73817
- atemp       1  12150238 127019542 74097
- hour_rent   1  12266423 127135727 74104

Step:  AIC=73303.97
count ~ atemp + hour_rent + humidity + season + weather

             Df Sum of Sq       RSS   AIC
+ windspeed   1     77129 114266576 73301
+ workingday  1     30192 114313512 73304
<none>                    114343704 73304
+ holiday     1      3786 114339918 73306
- weather     3    525600 114869304 73333
- season      3   2367348 116711052 73454
- humidity    1   5224113 119567817 73642
- atemp       1  12055477 126399182 74066
- hour_rent   1  12568342 126912046 74097

Step:  AIC=73300.82
count ~ atemp + hour_rent + humidity + season + weather + windspeed

             Df Sum of Sq       RSS   AIC
<none>                    114266576 73301
+ workingday  1     28599 114237977 73301
+ holiday     1      4334 114262242 73303
- windspeed   1     77129 114343704 73304
- weather     3    571495 114838071 73333
- season      3   2382626 116649202 73452
- humidity    1   4315759 118582335 73581
- atemp       1  12116310 126382885 74067
- hour_rent   1  12411506 126678081 74085
# check the summary
summary(lm_bike_stepwise)

Call:
lm(formula = count ~ atemp + hour_rent + humidity + season + 
    weather + windspeed, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-285.56  -78.70  -25.63   47.20  586.14 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.56655    8.35898   1.264   0.2062    
atemp         7.46224    0.26269  28.407  < 2e-16 ***
hour_rent     6.21766    0.21626  28.750  < 2e-16 ***
humidity     -1.50051    0.08851 -16.954  < 2e-16 ***
season2       9.95637    4.55753   2.185   0.0289 *  
season3     -38.55100    6.14342  -6.275 3.68e-10 ***
season4      21.22676    4.40003   4.824 1.43e-06 ***
weather2      2.68041    3.47136   0.772   0.4400    
weather3    -31.47986    5.60817  -5.613 2.06e-08 ***
weather4     83.31112  122.62122   0.679   0.4969    
windspeed     0.40641    0.17932   2.266   0.0235 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 122.5 on 7610 degrees of freedom
Multiple R-squared:  0.3435,    Adjusted R-squared:  0.3426 
F-statistic: 398.2 on 10 and 7610 DF,  p-value: < 2.2e-16

\(R^2\) increses!

Об’єднаємо для наочності коефіцієнти у один датафрейм:

lm_coefs_def <- data.frame(lm_name = names(lm_bike$coefficients),
                           name = lm_bike$coefficients, 
                           row.names = c() # cleaning rownames
                          )
lm_coefs_def
A data.frame: 13 × 2
lm_namename
<chr><dbl>
(Intercept) 26.97221255
season2 9.25256711
season3 -45.22395805
season4 22.00064078
holiday1 -11.34364536
workingday1 -4.86348665
weather2 3.02977126
weather3 -32.19036573
weather4 88.33390060
temp 8.34447012
humidity -1.47674448
windspeed 0.06683245
hour_rent 6.20730753
lm_coefs_sw <- data.frame(lm_name = names(lm_bike_stepwise$coefficients), name = lm_bike_stepwise$coefficients, row.names = c())
lm_coefs_sw
A data.frame: 11 × 2
lm_namename
<chr><dbl>
(Intercept) 10.5665521
atemp 7.4622378
hour_rent 6.2176579
humidity -1.5005074
season2 9.9563690
season3 -38.5510028
season4 21.2267554
weather2 2.6804117
weather3 -31.4798561
weather4 83.3111193
windspeed 0.4064147
# lets combine coefficients from both models
lm_coefs <- merge(lm_coefs_def, lm_coefs_sw, by = "lm_name", X.all = T, X.all = T)
colnames(lm_coefs) <- c("variable", "lm", "lm_sw")
lm_coefs
A data.frame: 10 × 3
variablelmlm_sw
<chr><dbl><dbl>
(Intercept) 26.97221255 10.5665521
hour_rent 6.20730753 6.2176579
humidity -1.47674448 -1.5005074
season2 9.25256711 9.9563690
season3 -45.22395805-38.5510028
season4 22.00064078 21.2267554
weather2 3.02977126 2.6804117
weather3 -32.19036573-31.4798561
weather4 88.33390060 83.3111193
windspeed 0.06683245 0.4064147

Як бачимо моделі мають незначні відмінності.


3.6 Model errors analysis

Запишемо у train набір даних модельовані значення count та похибки:

train_data$predicted <- lm_bike$fitted.values
train_data$residuals <- lm_bike$residuals
head(train_data)
A data.frame: 6 × 13
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcounthour_rentpredictedresiduals
<dttm><fct><fct><fct><fct><dbl><dbl><int><dbl><int><dbl><dbl><dbl>
12011-01-01 00:00:0010019.8414.395810.0000160-10.534505 26.534505
22011-01-01 01:00:0010019.0213.635800.0000401 -9.692918 49.692918
32011-01-01 02:00:0010019.0213.635800.0000322 -3.485611 35.485611
42011-01-01 03:00:0010019.8414.395750.0000133 16.947885 -3.947885
52011-01-01 04:00:0010019.8414.395750.0000 14 23.155192-22.155192
62011-01-01 05:00:0010029.8412.880756.0032 15 32.793480-31.793480

Візуалізуємо відхилення моделі:

plot_data <- train_data[1:100,]
ggplot(plot_data, aes(x = c(1:nrow(plot_data)), y = as.numeric(count))) +
  geom_segment(aes(xend = c(1:nrow(plot_data)), yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals), size = 2) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1, size = 2) +
  geom_hline(yintercept=0.5, linetype="dashed", color = "red") +
  theme_bw()
Warning message:
"`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead."

Lets check the same for test data

First we need to predict values:

test_data$predicted <- predict(lm_bike, newdata = test_data)
head(test_data)
A data.frame: 6 × 12
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcounthour_rentpredicted
<dttm><fct><fct><fct><fct><dbl><dbl><int><dbl><int><dbl><dbl>
12012-05-16 22:00:00201126.2430.3057311.001428622279.8139
22012-05-16 23:00:00201125.4229.54578 7.001513323271.5277
32012-05-17 00:00:00201124.6028.79078 8.9981 79 0122.0506
42012-05-17 01:00:00201124.6028.0308311.0014 28 1121.0080
52012-05-17 02:00:00201124.6028.7907811.0014 16 2134.5991
62012-05-17 03:00:00201124.6029.5457316.9979 3 3148.5908
# resulduals calculation
test_data$residuals <- test_data$count - test_data$predicted
head(test_data)
A data.frame: 6 × 13
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcounthour_rentpredictedresiduals
<dttm><fct><fct><fct><fct><dbl><dbl><int><dbl><int><dbl><dbl><dbl>
12012-05-16 22:00:00201126.2430.3057311.001428622279.8139 6.186142
22012-05-16 23:00:00201125.4229.54578 7.001513323271.5277-138.527654
32012-05-17 00:00:00201124.6028.79078 8.9981 79 0122.0506 -43.050553
42012-05-17 01:00:00201124.6028.0308311.0014 28 1121.0080 -93.008024
52012-05-17 02:00:00201124.6028.7907811.0014 16 2134.5991-118.599054
62012-05-17 03:00:00201124.6029.5457316.9979 3 3148.5908-145.590844
plot_data <- test_data[1:100,]
ggplot(plot_data, aes(x = c(1:nrow(plot_data)), y = as.numeric(count))) +
  geom_segment(aes(xend = c(1:nrow(plot_data)), yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals), size = 2) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1, size = 2) +
  geom_hline(yintercept=0.5, linetype="dashed", color = "red") +
  theme_bw()
Warning message:
"`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead."

library(modelr)
resids <- data.frame(
  R2 = c(rsquare(lm_bike, data = train_data), rsquare(lm_bike, data = test_data)),
  MSE = c(mse(lm_bike, data = train_data),mse(lm_bike, data = test_data)),
  RMSE = c(rmse(lm_bike, data = train_data), rmse(lm_bike, data = test_data))#,
  #MAE = c(mae(lm_bike, data = train_data), mae(lm_bike, data = test_data)), # if needed 
  #MAPE = c(mape(lm_bike, data = train_data), mape(lm_bike, data = test_data)) # if needed
)
          
rownames(resids) <- c("train", "test") # give names for rows

resids
A data.frame: 2 × 3
R2MSERMSE
<dbl><dbl><dbl>
train0.342244315022.59122.5667
test0.285474140401.94201.0023

Conclusion: Test set has less RSquared and Bigger Errors variation. Train dataset is closer, because it built on training data and model include information about data.