25  Feature engineering in R


You need this packages for code execution:

# install.packages("ggplot2")
install.packages("gridExtra")
# install.packages("scorecard")
# install.packages("caret")
# install.packages("gmodels")
# install.packages("superml") #you olso need R6 package
package 'gridExtra' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    D:\Temp\Rtmpum9OKD\downloaded_packages
invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

25.1 What’s Feature Engineering?

Feature engineering is the most important technique used in creating machine learning models.

Feature Engineering is a basic term used to cover many operations that are performed on the variables(features) to fit them into the algorithm. It helps in increasing the accuracy of the model thereby enhances the results of the predictions. Feature Engineered machine learning models perform better on data than basic machine learning models. The following aspects of feature engineering are as follows [1]:

  1. Feature Scaling: It is done to get the features on the same scale( for eg. Euclidean distance).
  2. Feature Transformation: It is done to normalize the data(feature) by a function.
  3. Feature Construction: It is done to create new features based on original descriptors to improve the accuracy of the predictive model.

A "feature" in the context of predictive modeling is just another name for a predictor variable. Feature engineering is the general term for creating and manipulating predictors so that a good predictive model can be created.


25.2 Feature Scaling

Feature Scaling refers to putting the values in the same range or same scale so that no variable is dominated by the other.

Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.

Here’s the curious thing about feature scaling – it improves (significantly) the performance of some machine learning algorithms and does not work at all for others.

Also, what’s the difference between normalization and standardization? These are two of the most commonly used feature scaling techniques in machine learning but a level of ambiguity exists in their understanding.


25.2.1 Normalization

25.2.1.1 Theory

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Here’s the formula for normalization:

\(X' = \frac{X-X_{min}}{X_{max} - X_{min}}\)

Here, \(X_{max}\) and \(X_{min}\) are the maximum and the minimum values of the feature respectively.

When the value of \(X\) is the minimum value in the column, the numerator will be \(0\), and hence \(X'\) is \(0\).

On the other hand, when the value of \(X\) is the maximum value in the column, the numerator is equal to the denominator and thus the value of \(X'\) is \(1\).

If the value of \(X\) is between the minimum and the maximum value, then the value of \(X'\) is between \(0\) and \(1\).


25.2.1.2 Practice

So, let’s implement own normalization function.

# Lets use client churn dataset from telco: https://www.kaggle.com/blastchar/telco-customer-churn
churn_data <- read.csv("data/telecom_users.csv")
head(churn_data)
A data.frame: 6 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL No No No Yes Month-to-month Yes Electronic check 55.9 238.50 No
5 432 2181-UAESM Male 0 No No 2 Yes No DSL Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No
str(churn_data)
'data.frame':   5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
 $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
 $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
 $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
 $ MonthlyCharges  : chr  "24.1" "88.15" "74.95" "55.9" ...
 $ TotalCharges    : num  1735 3973 2870 238 120 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...
# next check summary of values
summary(churn_data)

# check TotalCharges field
       X         customerID           gender          SeniorCitizen   
 Min.   :   0   Length:5986        Length:5986        Min.   :0.0000  
 1st Qu.:1777   Class :character   Class :character   1st Qu.:0.0000  
 Median :3546   Mode  :character   Mode  :character   Median :0.0000  
 Mean   :3534                                         Mean   :0.1614  
 3rd Qu.:5292                                         3rd Qu.:0.0000  
 Max.   :7042                                         Max.   :1.0000  
                                                                      
   Partner           Dependents            tenure      PhoneService      
 Length:5986        Length:5986        Min.   : 0.00   Length:5986       
 Class :character   Class :character   1st Qu.: 9.00   Class :character  
 Mode  :character   Mode  :character   Median :29.00   Mode  :character  
                                       Mean   :32.47                     
                                       3rd Qu.:56.00                     
                                       Max.   :72.00                     
                                                                         
 MultipleLines      InternetService    OnlineSecurity     OnlineBackup      
 Length:5986        Length:5986        Length:5986        Length:5986       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 DeviceProtection   TechSupport        StreamingTV        StreamingMovies   
 Length:5986        Length:5986        Length:5986        Length:5986       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Contract         PaperlessBilling   PaymentMethod      MonthlyCharges    
 Length:5986        Length:5986        Length:5986        Length:5986       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  TotalCharges       Churn          
 Min.   :  18.8   Length:5986       
 1st Qu.: 404.3   Class :character  
 Median :1412.2   Mode  :character  
 Mean   :2298.1                     
 3rd Qu.:3847.0                     
 Max.   :8684.8                     
 NA's   :10                         

Next, lets build histogram of Income and check how it splited with ggplot2:

#install.packages("ggplot2")
library(ggplot2)

ggplot() - function for building charts

data - first parameter - dataset

aes() - authetics - visualition axis / Construct aesthetic mappings

geom_CHART_TYPE() - set the chart type

geom_histogram() - Histograms and frequency polygons

//theme_set() - theme configutation

library(ggplot2)
ggplot(data = churn_data, aes(x=TotalCharges)) + geom_histogram(bins = 15)

# try theme
Warning message:
"Removed 10 rows containing non-finite values (stat_bin)."

# Lets replace missing with 0 zero for TotalCharges 
library(magrittr) # if pipe not loaded
library(dplyr) # for mutate function
churn_data <- churn_data %>%
            mutate(TotalCharges = ifelse(is.na(TotalCharges), 0 , TotalCharges))

ggplot(churn_data, aes(x=TotalCharges)) + geom_histogram(bins = 15)

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Lets implement own normalization function by fomula explained earlie
normalizeData <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}
library(dplyr)
# Normalize TotalCharges
churn_data <- churn_data |>
    filter(!is.na(TotalCharges)) |>
    mutate(TotalChargesNorm = normalizeData(TotalCharges))

churn_data %>% head() # check the last columns
A data.frame: 6 × 23
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn TotalChargesNorm
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No 0.19799792
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No 0.45631202
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes 0.32899261
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL No No Yes Month-to-month Yes Electronic check 55.9 238.50 No 0.02535195
5 432 2181-UAESM Male 0 No No 2 Yes No DSL No No No Month-to-month No Electronic check 53.45 119.50 No 0.01162012
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No 0.38672975
#summary for the last field
summary(churn_data$TotalChargesNorm)

#its from 1 to zero
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.04449 0.16078 0.26301 0.44174 1.00000 
# And lets make a histogram
ggplot(churn_data, aes(x=TotalChargesNorm)) + geom_histogram(bins = 15)

We observe identical histograms even though the TotalCharges / TotalChargesNorm axis is rescaled.

Therefore we show that normalization didn’t affect the distribution properties of the rescaled data.


25.2.2 Standardization

25.2.2.1 Theory

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Here’s the formula for standardization:

\(X' = \frac{X-\mu}{\sigma}\)

Feature scaling: \(\mu\) is the mean of the feature values and Feature scaling: \(\sigma\) is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

Now, the big question in your mind must be when should we use normalization and when should we use standardization?

Normalization vs. standardization is an eternal question among machine learning newcomers. Let me elaborate on the answer in this section.

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization. However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.

It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.


25.2.2.2 Practice

Lets write own function for standartization:

# if sdev is NA - calculate start deviation from data

standartize <- function(data) {    
    sdev = sd(data, na.rm = TRUE)  
    data <- (data - mean(data, na.rm = T)) / sdev
    return (data)    
}
# Normalize TotalCharges
churn_data <- churn_data %>%
    mutate(TotalChargesStand = standartize(TotalCharges))

churn_data %>% head() # check the last columns
A data.frame: 6 × 24
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn TotalChargesNorm TotalChargesStand
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No 0.19799792 -0.2477481
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No 0.45631202 0.7366076
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes 0.32899261 0.2514325
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL No Yes Month-to-month Yes Electronic check 55.9 238.50 No 0.02535195 -0.9056488
5 432 2181-UAESM Male 0 No No 2 Yes No DSL No No Month-to-month No Electronic check 53.45 119.50 No 0.01162012 -0.9579766
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No 0.38672975 0.4714509

Lets compare data distribution fot normalization and standartization.

#install.packages("gridExtra") # to view 2+ ggplots 
library(gridExtra)
n_plot <- ggplot(churn_data, aes(x=TotalChargesNorm)) + geom_histogram(bins = 15)
s_plot <- ggplot(churn_data, aes(x=TotalChargesStand)) + geom_histogram(bins = 15)
init_plot <- ggplot(churn_data, aes(x=TotalCharges)) + geom_histogram(bins = 15)
grid.arrange(n_plot, init_plot, s_plot, ncol=3) # from gridExtra

# data distribution changed after standartisation scaling

Attaching package: 'gridExtra'


The following object is masked from 'package:dplyr':

    combine

So, lets use stardart R function for scaling and compare results:

# the next 
churn_data <- churn_data %>%
        mutate(TotalChargesScaled = as.numeric(scale(TotalCharges))) 

s1_plot <- ggplot(churn_data, aes(x=TotalChargesStand)) + geom_histogram(bins = 15)
s2_plot <- ggplot(churn_data, aes(x=TotalChargesScaled)) + geom_histogram(bins = 15)
grid.arrange(s1_plot, s2_plot, ncol=2) # it looks like we created the same function 

mean(churn_data$TotalCharges)
2298.06061746988

Look like default function has the same result as our.

head(churn_data) # check last two columns, its the same
A data.frame: 6 × 25
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService ... StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn TotalChargesNorm TotalChargesStand TotalChargesScaled
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> ... <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No ... No internet service Two year No Credit card (automatic) 24.1 1734.65 No 0.19973402 -0.2460559 -0.2460559
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic ... No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No 0.45748895 0.7382838 0.7382838
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic ... No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes 0.33044515 0.2531165 0.2531165
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL ... Yes Month-to-month Yes Electronic check 55.9 238.50 No 0.02746177 -0.9039460 -0.9039460
5 432 2181-UAESM Male 0 No No 2 Yes No DSL ... No Month-to-month No Electronic check 53.45 119.50 No 0.01375967 -0.9562729 -0.9562729
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL ... Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No 0.38805730 0.4731314 0.4731314

25.2.3 Scaling for train/test/validation/prediction

If you use scaling initial parameters should be remembered somewhere for future prediction data and reimplemented for new/test/validation/prediction dataset

For experiment lets split our dataset for train and test:

library(caret)
set.seed(2021)
 
index = createDataPartition(churn_data$Churn, p = 0.70, list = FALSE)
train = churn_data[index, ]
test = churn_data[-index, ]

nrow(churn_data)
nrow(train)
nrow(test)
Loading required package: lattice
5976
4184
1792

Lets rescale TotalCharges data for training set:

train <- train %>% mutate(TotalChargesScaled = scale(TotalCharges))
head(train) # you can see that TotalChangesStand and TotalChangesScaled are different becouse of changed mean and standart deviation of data
A data.frame: 6 × 25
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn TotalChargesNorm TotalChargesStand TotalChargesScaled
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl[,1]>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service Two year No Credit card (automatic) 24.1 1734.65 No 0.19799792 -0.2477481 -0.2500210
5 432 2181-UAESM Male 0 No No 2 Yes No DSL No Month-to-month No Electronic check 53.45 119.50 No 0.01162012 -0.9579766 -0.9563465
9 1480 8898-KASCD Male 0 No No 39 No No phone service DSL No One year No Mailed check 35.55 1309.15 No 0.14889799 -0.4348528 -0.4360975
10 5137 8016-NCFVO Male 1 No No 55 Yes Yes Fiber optic Yes Month-to-month Yes Electronic check 116.5 6382.55 No 0.73433533 1.7960690 1.7825644
11 3169 4578-PHJYZ Male 0 Yes Yes 52 Yes No DSL No One year Yes Electronic check 68.75 3482.85 No 0.39972883 0.5209864 0.5144889
12 4653 2091-MJTFX Female 0 Yes Yes 30 No No phone service DSL Yes Month-to-month No Credit card (automatic) 51.2 1561.50 Yes 0.17801754 -0.3238872 -0.3257417

So, for train, test and prediction data we should use the same scaling base, in this case mean and standart deviation.

Correct data scaling code should be like this:

# fix mean and sd
meanTotalCharges = mean(train$TotalCharges, na.rm = T)
sdTotalCharges = sd(train$TotalCharges, na.rm = T)

train <- train %>% mutate(TotalChargesScaled = scale(TotalCharges, center = meanTotalCharges, scale = sdTotalCharges)) # default
test <- test %>% mutate(TotalChargesScaled = scale(TotalCharges, center = meanTotalCharges, scale = sdTotalCharges)) # use parameters of train set

sd(train$TotalChargesScaled)
sd(test$TotalChargesScaled)

#check the same value TotalCharges == 0 in train and set
head(train %>% filter(TotalCharges == 0))
head(test %>% filter(TotalCharges == 0))
1
0.981778490083862
ERROR while rich displaying an object: Error in apply(apply(col, 2L, format), 1L, paste, collapse = ", "): dim(X) must have a positive length

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
2. tryCatchList(expr, classes, parentenv, handlers)
3. tryCatchOne(expr, names, parentenv, handlers[[1L]])
4. doTryCatch(return(expr), name, parentenv, handler)
5. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
6. repr::mime2repr[[mime]](obj)
7. repr_text.data.frame(obj)
8. ellip_limit_arr(obj, rows, cols)
9. arr_parts_format(parts)
10. structure(lapply(parts, arr_part_format), omit = attr(parts, 
  .     "omit"))
11. lapply(parts, arr_part_format)
12. FUN(X[[i]], ...)
13. vapply(part, function(col) {
  .     if (is.matrix(col)) 
  .         apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
  .     else format(col)
  . }, character(nrow(part)))
14. FUN(X[[i]], ...)
15. apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
16. stop("dim(X) must have a positive length")
ERROR while rich displaying an object: Error in apply(apply(col, 2L, format), 1L, paste, collapse = ", "): dim(X) must have a positive length

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
2. tryCatchList(expr, classes, parentenv, handlers)
3. tryCatchOne(expr, names, parentenv, handlers[[1L]])
4. doTryCatch(return(expr), name, parentenv, handler)
5. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
6. repr::mime2repr[[mime]](obj)
7. repr_text.data.frame(obj)
8. ellip_limit_arr(obj, rows, cols)
9. arr_parts_format(parts)
10. structure(lapply(parts, arr_part_format), omit = attr(parts, 
  .     "omit"))
11. lapply(parts, arr_part_format)
12. FUN(X[[i]], ...)
13. vapply(part, function(col) {
  .     if (is.matrix(col)) 
  .         apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
  .     else format(col)
  . }, character(nrow(part)))
14. FUN(X[[i]], ...)
15. apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
16. stop("dim(X) must have a positive length")
#compare it with all dataset TotalCharges == 0
filter(churn_data, TotalCharges == 0)
# for now TotalChargesScaled in train/test th same, but in churn data its different, because of diffrent scaling bases
A data.frame: 10 × 25
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService ... StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn TotalChargesNorm TotalChargesStand TotalChargesScaled
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> ... <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
6754 2775-SEFEE Male 0 No Yes 0 Yes Yes DSL ... No Two year Yes Bank transfer (automatic) 61.9 0 No 0 -1.00882 -1.00882
1340 1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL ... No Two year No Credit card (automatic) 56.05 0 No 0 -1.00882 -1.00882
3826 3213-VVOLG Male 0 Yes Yes 0 Yes Yes No ... No internet service Two year No Mailed check 25.35 0 No 0 -1.00882 -1.00882
5218 2923-ARZLG Male 0 Yes Yes 0 Yes No No ... No internet service One year Yes Mailed check 19.7 0 No 0 -1.00882 -1.00882
3331 7644-OMVMY Male 0 Yes Yes 0 Yes No No ... No internet service Two year No Mailed check 19.85 0 No 0 -1.00882 -1.00882
936 5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL ... Yes Two year No Mailed check 80.85 0 No 0 -1.00882 -1.00882
753 3115-CZMZD Male 0 No Yes 0 Yes No No ... No internet service Two year No Mailed check 20.25 0 No 0 -1.00882 -1.00882
4380 2520-SGTTA Female 0 Yes Yes 0 Yes No No ... No internet service Two year No Mailed check 20.0 0 No 0 -1.00882 -1.00882
488 4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL ... No Two year Yes Bank transfer (automatic) 52.55 0 No 0 -1.00882 -1.00882
1082 4367-NUYAO Male 0 Yes Yes 0 Yes Yes No ... No internet service Two year No Mailed check 25.75 0 No 0 -1.00882 -1.00882

25.3 Feature Transformation

Feature transformation involves manipulating a predictor variable in some way so as to improve its performance in the predictive model. A variety of considerations come into play when transforming models, including:

25.3.1 Scaling based on calculations

Sometimes for changing data distribution before using in modeling or change correlation between input and output variables scientist changes data type with standart mathematical functions. Lets try transform TotalCharges with logarithm, sqrt and power up 2.

library(gridExtra)
churn_data_tmp <- churn_data %>%
        mutate(TotalChargesLog = log(TotalCharges),
              TotalChargesSqrt = sqrt(TotalCharges),
              TotalChargesPow2 = TotalCharges^2)

plot1 <- ggplot(churn_data_tmp, aes(x=TotalChargesLog)) + geom_histogram(bins = 15)
plot2 <- ggplot(churn_data_tmp, aes(x=TotalChargesSqrt)) + geom_histogram(bins = 15)
plot3 <- ggplot(churn_data_tmp, aes(x=TotalChargesPow2)) + geom_histogram(bins = 15)
grid.arrange(plot1, plot2, plot3, ncol=3) 

Lets try

#install.packages("moments")
package 'moments' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    D:\Temp\Rtmpum9OKD\downloaded_packages
library(moments)

skewness(churn_data_tmp$TotalCharges)
skewness(churn_data_tmp$TotalChargesLog)
skewness(churn_data_tmp$TotalChargesSqrt)
skewness(churn_data_tmp$TotalChargesPow2)
0.949325175892764
-0.75144252264496
0.302580803403444
1.80376407354267

Conclusion: different scaling gives different data distribution and may improve model perfomance if you have found the correct form of dependence of input/ouput parameters.


25.4 Feature Construction

The feature Construction method helps in creating new features in the data thereby increasing model accuracy and overall predictions. It is of two types:

Додано мною, бо не знайшов ніде такого прийому) Всі одразу моделюють)) - [x] Evaluation - construction new features on raw data from datasource.

25.4.1 Binning

Binning is done to create bins for continuous variables where they are converted to categorical variables.

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

There are two types of binning: Unsupervised and Supervised.

Unsupervised Binning involves Automatic and Manual binning. In Automatic Binning, bins are created without human interference and are created automatically. In Manual Binning, bins are created with human interference and we specify where the bins to be created.

Supervised Binning involves creating bins for the continuous variable while taking the target variable into the consideration also. Supervised Discretization or Binning divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.

In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available, analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

There are many packages for creating new variables: smbinning, scorecard, rbin, InformationValue and other.

25.4.1.1 WOE binning: theory

Weight of evidence (WOE)

This is basically a technique that can be applied if we have a binary response variable and any kind of predictor variable. First we perform a reasonable binning on the response variable and then decide which form of the binary response we count as positive and which as negative. Then we calculate the percentage positive cases in each bin of the total of all positive cases. For example 20 positive cases in bin A out of 100 total positive cases in all bins equals 20 %. Next we calculate the percentage of negative cases in each bin of the total of all negative cases, for example 5 negative cases in bin A out of a total of 50 negative cases in all bins equals 10%. Then we calculate the WOE by dividing the bin percentages of positive cases by the bin percentage of negative cases, and take the logarithm. For the described example log(20/10).

Rule of thump: If WOE values are negative, negative cases supersede the positive cases. If WOE values are positive, positive cases supersede the negative cases.

This serves the following purposes:

  • We eliminate any none-linear relationships
  • We automatically scale all variables too some extend
  • We convert categorical variables to contineous variables
  • Missing Data can be handled as just another factor value
  • We can built a stand alone score card, that could be manually applied by a person with a pen and a printout of all relevant variables.

It has the following disadvantages:

  • We always loose information via binning
  • Score development along single variables is not contineous and occurs in steps
  • Binning requires manual revision
  • Calculating Variable importance is not as straight forward as with classical logistic regression with regularly scaled variables

Information Value (IV)

By doing another sequence of calculations similar to the WOE calculation we can calculate the IV. Classically this serves as variable ranking method and allows us to perform feature selection, which is less compuationally demanding as other methods.

Information Value Predictive Power
< 0.02 useless for prediction
0.02 - 0.1 weak predictor
0.1 - 0.3 medium predictor
0.3 - 0.5 strong predictor
> 0.5 suspicious too good to be true

After calculating WOE value it replaces the original values in dataset.

25.4.1.2 scorecard package and woebin()-function

woebin generates optimal binning for numerical, factor and categorical variables using methods including tree-like segmentation or chi-square merge. woebin can also customizing breakpoints if the breaks_list was provided. The default woe is defined as ln(Pos_i/Neg_i). If you prefer ln(Neg_i/Pos_i), please set the argument positive as negative value, such as ‘0’ or ‘good’. If there is a zero frequency class when calculating woe, the zero will replaced by 0.99 to make the woe calculable.

# lets try to bin InternetService, TotalCharges from churn_data_tmp
churn_data_tmp <- churn_data %>%
        mutate(Churn = ifelse(Churn == "Yes", 1, 0))

bin_data <- churn_data_tmp %>% select(customerID, InternetService, TotalCharges, Churn)
head(bin_data)

#churn_data_tmp%>%select(Churn) %>% distinct()
A data.frame: 6 × 4
customerID InternetService TotalCharges Churn
<chr> <chr> <dbl> <dbl>
1 7010-BRBUU No 1734.65 0
2 9688-YGXVR Fiber optic 3973.20 0
3 9286-DOJGF Fiber optic 2869.85 1
4 6994-KERXL DSL 238.50 0
5 2181-UAESM DSL 119.50 0
6 4312-GVYNH DSL 3370.20 0
# install.packages("scorecard")
library(scorecard)

bins = woebin(bin_data, # dataset
              y = 'Churn', # target variable
              x = c("InternetService", "TotalCharges")) # variables for binning
[INFO] creating woe binning ... 
# lets view our bins
bins
$InternetService
A data.table: 3 × 12
variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
<chr> <chr> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
InternetService No 1291 0.2156699 1192 99 0.07668474 -1.4687362 0.30636195 0.5897126 No FALSE
InternetService DSL 2068 0.3454728 1671 397 0.19197292 -0.4177094 0.05417755 0.5897126 DSL FALSE
InternetService Fiber optic 2627 0.4388573 1536 1091 0.41530263 0.6774449 0.22917306 0.5897126 Fiber optic FALSE
$TotalCharges
A data.table: 4 × 12
variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
<chr> <chr> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
TotalCharges [-Inf,200) 1004 0.16772469 517 487 0.4850598 0.9597530 0.181721173 0.3097399 200 FALSE
TotalCharges [200,400) 488 0.08152355 327 161 0.3299180 0.3109760 0.008431865 0.3097399 400 FALSE
TotalCharges [400,3800) 2981 0.49799532 2266 715 0.2398524 -0.1339571 0.008651146 0.3097399 3800 FALSE
TotalCharges [3800, Inf) 1513 0.25275643 1289 224 0.1480502 -0.7304442 0.110935711 0.3097399 Inf FALSE
churn_data_tmp %>% filter(TotalCharges > 3800) %>% select(Churn) %>% group_by(Churn) %>% summarize(n())
A tibble: 2 × 2
Churn n()
<dbl> <int>
0 1289
1 224
bins$TotalCharges %>% knitr::kable() # better view for RStudio, need knitr to be installed


|variable     |bin         | count| count_distr|  neg| pos|   posprob|        woe|    bin_iv|  total_iv|breaks |is_special_values |
|:------------|:-----------|-----:|-----------:|----:|---:|---------:|----------:|---------:|---------:|:------|:-----------------|
|TotalCharges |[-Inf,200)  |  1004|   0.1677247|  517| 487| 0.4850598|  0.9597530| 0.1817212| 0.3097399|200    |FALSE             |
|TotalCharges |[200,400)   |   488|   0.0815236|  327| 161| 0.3299180|  0.3109760| 0.0084319| 0.3097399|400    |FALSE             |
|TotalCharges |[400,3800)  |  2981|   0.4979953| 2266| 715| 0.2398524| -0.1339571| 0.0086511| 0.3097399|3800   |FALSE             |
|TotalCharges |[3800, Inf) |  1513|   0.2527564| 1289| 224| 0.1480502| -0.7304442| 0.1109357| 0.3097399|Inf    |FALSE             |
# preview the plot
woebin_plot(bins$TotalCharges)
$TotalCharges

If TotalCharges less than 200 its most risky group of customers, 48.5% of them are potential churn.

bins$InternetService %>% knitr::kable() 
woebin_plot(bins$InternetService)


|variable        |bin         | count| count_distr|  neg|  pos|   posprob|        woe|    bin_iv|  total_iv|breaks      |is_special_values |
|:---------------|:-----------|-----:|-----------:|----:|----:|---------:|----------:|---------:|---------:|:-----------|:-----------------|
|InternetService |No          |  1291|   0.2156699| 1192|   99| 0.0766847| -1.4687362| 0.3063620| 0.5897126|No          |FALSE             |
|InternetService |DSL         |  2068|   0.3454728| 1671|  397| 0.1919729| -0.4177094| 0.0541776| 0.5897126|DSL         |FALSE             |
|InternetService |Fiber optic |  2627|   0.4388573| 1536| 1091| 0.4153026|  0.6774449| 0.2291731| 0.5897126|Fiber optic |FALSE             |
$InternetService

Fiber optic InternetService customers are the most risky. 41.5% of them are potentiona churn.

The next stage is applying bins to th the variables:

bin_data_woe = woebin_ply(bin_data, bins) 
head(bin_data)
head(bin_data_woe) # compare how it changed
[INFO] converting into woe values ... 
A data.frame: 6 × 4
customerID InternetService TotalCharges Churn
<chr> <chr> <dbl> <dbl>
1 7010-BRBUU No 1734.65 0
2 9688-YGXVR Fiber optic 3973.20 0
3 9286-DOJGF Fiber optic 2869.85 1
4 6994-KERXL DSL 238.50 0
5 2181-UAESM DSL 119.50 0
6 4312-GVYNH DSL 3370.20 0
A data.table: 6 × 4
customerID Churn InternetService_woe TotalCharges_woe
<chr> <dbl> <dbl> <dbl>
7010-BRBUU 0 -1.4687362 -0.1339571
9688-YGXVR 0 0.6774449 -0.7304442
9286-DOJGF 1 0.6774449 -0.1339571
6994-KERXL 0 -0.4177094 0.3109760
2181-UAESM 0 -0.4177094 0.9597530
4312-GVYNH 0 -0.4177094 -0.1339571
bins$InternetService #No is replaced by WOE -1.4687362 (line 1)
A data.table: 3 × 12
variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
<chr> <chr> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
InternetService No 1291 0.2156699 1192 99 0.07668474 -1.4687362 0.30636195 0.5897126 No FALSE
InternetService DSL 2068 0.3454728 1671 397 0.19197292 -0.4177094 0.05417755 0.5897126 DSL FALSE
InternetService Fiber optic 2627 0.4388573 1536 1091 0.41530263 0.6774449 0.22917306 0.5897126 Fiber optic FALSE

So, both numerical and categorical variables are binned by woebin function.

In real-life project you should dou binning with the next steps:

  1. Clean data
  2. Split data into train/test (+prediction)
  3. Create bins on train set (and save them if your model not one-time used)
  4. Apply bins to all sets you have

Lest see how to make it with our dataset churn_data.

library(caret)
library(gmodels)
set.seed(2021) # to fix split options

churn_data <- read.csv("../../data/telecom_users.csv") # read data
churn_data <- churn_data %>% 
    select(customerID, gender, PaymentMethod, TotalCharges, Churn) %>% # select some columns for test + target
    mutate(Churn = ifelse(Churn == "Yes", 1, 0)) # replace Churn Yes/No with 1/0 - Event/NonEvent
head(churn_data)
A data.frame: 6 × 5
customerID gender PaymentMethod TotalCharges Churn
<chr> <chr> <chr> <dbl> <dbl>
1 7010-BRBUU Male Credit card (automatic) 1734.65 0
2 9688-YGXVR Female Credit card (automatic) 3973.20 0
3 9286-DOJGF Female Bank transfer (automatic) 2869.85 1
4 6994-KERXL Male Electronic check 238.50 0
5 2181-UAESM Male Electronic check 119.50 0
6 4312-GVYNH Female Bank transfer (automatic) 3370.20 0
index = createDataPartition(churn_data$Churn, p = 0.60, list = FALSE) # select randomly indexes of the rows for train
train = churn_data[index, ]
test = churn_data[-index, ]
data_bins = woebin(train, # dataset
                   y = "Churn", # target
                   var_skip = "customerID" # skip ID
                  # x = c("gender", "PaymentMethod", "TotalCharges") # select some varables
                   #var_skip = "customerID" # target variable - not working in jupyter
             )
[INFO] creating woe binning ... 
data_bins
$gender
A data.table: 2 × 12
variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
<chr> <chr> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
gender Female 1770 0.4927617 1314 456 0.2576271 -0.02546718 0.0003176558 0.0006226087 Female FALSE
gender Male 1822 0.5072383 1335 487 0.2672887 0.02444876 0.0003049529 0.0006226087 Male FALSE
$PaymentMethod
A data.table: 3 × 12
variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
<chr> <chr> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
PaymentMethod Credit card (automatic)%,%Bank transfer (automatic) 1562 0.4348552 1312 250 0.1600512 -0.6249758 0.14385062 0.4324699 Credit card (automatic)%,%Bank transfer (automatic) FALSE
PaymentMethod Mailed check 822 0.2288419 665 157 0.1909976 -0.4106700 0.03472141 0.4324699 Mailed check FALSE
PaymentMethod Electronic check 1208 0.3363029 672 536 0.4437086 0.8067470 0.25389789 0.4324699 Electronic check FALSE
$TotalCharges
A data.table: 5 × 12
variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
<chr> <chr> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
TotalCharges missing 6 0.001670379 6 0 0.00000000 -0.7699879 0.0009365099 0.3222676 missing TRUE
TotalCharges [-Inf,400) 893 0.248608018 505 388 0.43449048 0.7682688 0.1693136398 0.3222676 400 FALSE
TotalCharges [400,3200) 1601 0.445712695 1207 394 0.24609619 -0.0877204 0.0033567382 0.3222676 3200 FALSE
TotalCharges [3200,6600) 850 0.236636971 711 139 0.16352941 -0.6003766 0.0727391340 0.3222676 6600 FALSE
TotalCharges [6600, Inf) 242 0.067371938 220 22 0.09090909 -1.2707632 0.0759215884 0.3222676 Inf FALSE
train_woe = woebin_ply(train, data_bins) 
head(train_woe)
test_woe = woebin_ply(test, data_bins) 
head(test_woe)
[INFO] converting into woe values ... 
A data.table: 6 × 5
customerID Churn gender_woe PaymentMethod_woe TotalCharges_woe
<chr> <dbl> <dbl> <dbl> <dbl>
7010-BRBUU 0 0.02444876 -0.6249758 -0.0877204
4367-NHWMM 0 -0.02546718 -0.4106700 0.7682688
8016-NCFVO 0 0.02444876 0.8067470 -0.6003766
4578-PHJYZ 0 0.02444876 0.8067470 -0.6003766
2091-MJTFX 1 -0.02546718 -0.6249758 -0.0877204
2277-DJJDL 0 0.02444876 0.8067470 -0.6003766
[INFO] converting into woe values ... 
A data.table: 6 × 5
customerID Churn gender_woe PaymentMethod_woe TotalCharges_woe
<chr> <dbl> <dbl> <dbl> <dbl>
9688-YGXVR 0 -0.02546718 -0.6249758 -0.6003766
9286-DOJGF 1 -0.02546718 -0.6249758 -0.0877204
6994-KERXL 0 0.02444876 0.8067470 0.7682688
2181-UAESM 0 0.02444876 0.8067470 0.7682688
4312-GVYNH 0 -0.02546718 -0.6249758 -0.6003766
2495-KZNFB 0 -0.02546718 0.8067470 -0.0877204

For now our data is ready for modeling.


25.4.1.3 Encoding

Encoding is the process in which numerical variables or features are created from categorical variables. It is a widely used method in the industry and in every model building process. It is of two types: Label Encoding and One-hot Encoding.

Label Encoding involves assigning each label a unique integer or value based on alphabetical ordering. It is the most popular and widely used encoding.

One-hot Encoding involves creating additional features or variables on the basis of unique values in categorical variables i.e. every unique value in the category will be added as a new feature.

25.4.1.3.1 Label encoding
# Lets use client churn dataset from telco: https://www.kaggle.com/blastchar/telco-customer-churn
churn_data <- read.csv("data/telecom_users.csv")
head(churn_data)
A data.frame: 6 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL No No No Yes Month-to-month Yes Electronic check 55.9 238.50 No
5 432 2181-UAESM Male 0 No No 2 Yes No DSL Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No

25.4.1.3.2 Encoding with factors

Its good way to encode labels as factors in R. This approach used for categorical data encoding. Easiest way is convert character values to factor and later convert factor to numeric:

churn_data <- churn_data %>% mutate(Partner = as.factor(Partner),
                                   Dependents = as.factor(Dependents))
head(churn_data) # data types changed for selected fields
A data.frame: 6 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <fct> <fct> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL No No No Yes Month-to-month Yes Electronic check 55.9 238.50 No
5 432 2181-UAESM Male 0 No No 2 Yes No DSL Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No
str(churn_data$Partner) # check levels
 Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 1 1 ...
# convert to integers
churn_data <- churn_data %>% mutate(Partner = as.integer(Partner),
                                   Dependents = as.integer(Dependents))
head(churn_data) # data types changed for selected fields
A data.frame: 6 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1869 7010-BRBUU Male 0 2 2 72 Yes Yes No No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No
2 4528 9688-YGXVR Female 0 1 1 44 Yes No Fiber optic Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 6344 9286-DOJGF Female 1 2 1 38 Yes Yes Fiber optic No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6739 6994-KERXL Male 0 1 1 4 Yes No DSL No No No Yes Month-to-month Yes Electronic check 55.9 238.50 No
5 432 2181-UAESM Male 0 1 1 2 Yes No DSL Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 2215 4312-GVYNH Female 0 2 1 70 No No phone service DSL Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No

But you should check factor to numbers converting to synchronize train/test/prediction correct numbering. In this case you should save somewhere levels of factor. Better way is to use encoders with fitting.


25.4.1.3.3 Encoding with LabelEncoder from superml package
# install.packages("superml") #you olso need R6 package
also installing the dependency 'BH'

package 'BH' successfully unpacked and MD5 sums checked
package 'superml' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    D:\Temp\Rtmpum9OKD\downloaded_packages
library(superml)

label <- LabelEncoder$new() # create new encoder
print(label$fit(churn_data$InternetService)) # fir the encoder
Loading required package: R6
[1] TRUE
# encode data
churn_data$InternetService <- label$fit_transform(churn_data$InternetService) 
# do not re-run it, because InternetService ineteger for now
head(churn_data)
A data.frame: 6 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> ... <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1869 7010-BRBUU Male 0 2 2 72 Yes Yes 0 ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No
2 4528 9688-YGXVR Female 0 1 1 44 Yes No 1 ... Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 6344 9286-DOJGF Female 1 2 1 38 Yes Yes 1 ... No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6739 6994-KERXL Male 0 1 1 4 Yes No 2 ... No No No Yes Month-to-month Yes Electronic check 55.9 238.50 No
5 432 2181-UAESM Male 0 1 1 2 Yes No 2 ... Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 2215 4312-GVYNH Female 0 2 1 70 No No phone service 2 ... Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No

Now you can save your encoding configuration and use it for test/prediction sets later.


25.4.1.4 One-hot encoding

# Lets use client churn dataset from telco: https://www.kaggle.com/blastchar/telco-customer-churn
churn_data <- read.csv("data/telecom_users.csv")
head(churn_data)
A data.frame: 6 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.1 1734.65 No
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6739 6994-KERXL Male 0 No No 4 Yes No DSL No No No Yes Month-to-month Yes Electronic check 55.9 238.50 No
5 432 2181-UAESM Male 0 No No 2 Yes No DSL Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 2215 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No
# check possible Gender values
gmodels::CrossTable(churn_data$gender)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5986 

 
          |    Female |      Male | 
          |-----------|-----------|
          |      2936 |      3050 | 
          |     0.490 |     0.510 | 
          |-----------|-----------|



 

Lets create two additional variables Male and Female encoded by 1 / 0:

# if gender is factor
#churn_data <- churn_data %>% 
   #     mutate(gender = as.character(gender)) # convert Gender to character from Factor

churn_data <- churn_data %>% 
        mutate(
            # create Male column
            Male = ifelse(gender == "Male", 1, 0),
            # create Female column
            Female = ifelse(gender == "Female", 1, 0))

churn_data |> select(gender, Male, Female) |> head()
A data.frame: 6 × 3
gender Male Female
<chr> <dbl> <dbl>
1 Male 1 0
2 Female 0 1
3 Female 0 1
4 Male 1 0
5 Male 1 0
6 Female 0 1

Lets create a dummy variables for InternetService column. So, what data it has for now?

gmodels::CrossTable(churn_data$InternetService)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5986 

 
            |         DSL | Fiber optic |          No | 
            |-------------|-------------|-------------|
            |        2068 |        2627 |        1291 | 
            |       0.345 |       0.439 |       0.216 | 
            |-------------|-------------|-------------|



 

The function dummyVars() from caret package can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.

library(caret)

churn_data <- read.csv("data/telecom_users.csv")

dummy <- dummyVars(" ~ InternetService", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()
A data.frame: 6 × 3
InternetServiceDSL InternetServiceFiber.optic InternetServiceNo
<dbl> <dbl> <dbl>
1 0 0 1
2 0 1 0
3 0 1 0
4 1 0 0
5 1 0 0
6 1 0 0

If you want create dummy variables for all categoriacal columns just use " ~ ." formula.

dummy <- dummyVars(" ~ .", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()
A data.frame: 6 × 7560
X customerID0002.ORFBO customerID0003.MKNFE customerID0004.TLHLJ customerID0011.IGKFF customerID0013.EXCHZ customerID0013.MHZWF customerID0013.SMEOE customerID0014.BMAQU customerID0015.UOCOJ MonthlyCharges99.7 MonthlyCharges99.75 MonthlyCharges99.8 MonthlyCharges99.85 MonthlyCharges99.9 MonthlyCharges99.95 MonthlyChargesNULL TotalCharges ChurnNo ChurnYes
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1869 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1734.65 1 0
2 4528 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3973.20 1 0
3 6344 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2869.85 0 1
4 6739 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 238.50 1 0
5 432 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 119.50 1 0
6 2215 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3370.20 1 0

You can remove some variables, for example Churn, becouse its target variable

dummy <- dummyVars(" ~. -Churn", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()
A data.frame: 6 × 7558
X customerID0002.ORFBO customerID0003.MKNFE customerID0004.TLHLJ customerID0011.IGKFF customerID0013.EXCHZ customerID0013.MHZWF customerID0013.SMEOE customerID0014.BMAQU customerID0015.UOCOJ ... MonthlyCharges99.6 MonthlyCharges99.65 MonthlyCharges99.7 MonthlyCharges99.75 MonthlyCharges99.8 MonthlyCharges99.85 MonthlyCharges99.9 MonthlyCharges99.95 MonthlyChargesNULL TotalCharges
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1869 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1734.65
2 4528 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3973.20
3 6344 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2869.85
4 6739 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 238.50
5 432 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 119.50
6 2215 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3370.20

If you need some few do this with + operator:

dummy <- dummyVars(" ~ InternetService + PhoneService", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()
A data.frame: 6 × 5
InternetServiceDSL InternetServiceFiber.optic InternetServiceNo PhoneServiceNo PhoneServiceYes
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 1 0 1
2 0 1 0 0 1
3 0 1 0 0 1
4 1 0 0 0 1
5 1 0 0 0 1
6 1 0 0 1 0

25.4.2 Evaluate data on raw dataset

Previous data transformation approaches use prepared datasets. But sometimes you should create new variables from raw data than cannot be attached to you existing dataset and implemented as-is for modeling.

Lest check an example with customer transactions data.

Preview data file in excel before read it to R!
# read transaction data from dataset
# explore file in Excel before reading to check sheet numbers and tables structure
library(xlsx)
demographics <- read.xlsx("data/transactions.xlsx", sheetIndex = 1)
head(demographics)
java.home option: 

JAVA_HOME environment variable: C:\Program Files\Java\jdk-19\bin;C:\Program Files\Common Files\Oracle\Java\javapath\java.exe;C:\Program Files\Common Files\Oracle\Java\javapath\;

Warning message in fun(libname, pkgname):
"Java home setting is INVALID, it will be ignored.
Please do NOT set it unless you want to override system settings."
A data.frame: 4 × 4
CustomerID Gender Email VisitsLastYear
<dbl> <chr> <chr> <dbl>
1 787987456 Male Yes 12
2 456415151 Male NULL 0
3 215454555 Female No 16
4 985121122 Female No 4
# read transactions
transactions <- read.xlsx("data/transactions.xlsx", sheetIndex = 2)
head(transactions)
A data.frame: 6 × 7
CustomerID ContractID Date Time ContractorID UsdEquiv_sum Type
<dbl> <dbl> <date> <dbl> <chr> <dbl> <chr>
1 215454555 19065798 2019-03-15 114537 NULL 6293.7111 credet
2 215454555 19065798 2019-04-05 102525 NULL 914.9459 credet
3 215454555 19065798 2019-05-11 80833 NULL -4655.9111 debet
4 215454555 19065798 2019-05-16 74606 NULL -13900.6889 debet
5 215454555 19065798 2019-05-30 104506 NULL 2102.0570 credet
6 215454555 19065798 2019-08-23 122656 NULL -16244.2333 debet

Our nex task is to construct new features based on transactions history:

# lets find a maximum transaction date for each client
library(tidyverse) # includes magrittr, dplyr
max_dates <- transactions %>%
        group_by(CustomerID) %>%
        summarise(MaxDate = max(Date))
max_dates
-- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.2 --
v tibble  3.1.8     v purrr   0.3.5
v tidyr   1.2.1     v stringr 1.4.1
v readr   2.1.3     v forcats 0.5.2
-- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
x gridExtra::combine() masks dplyr::combine()
x dplyr::filter()      masks stats::filter()
x dplyr::lag()         masks stats::lag()
x purrr::lift()        masks caret::lift()
A tibble: 4 × 2
CustomerID MaxDate
<dbl> <date>
215454555 2021-04-30
456415151 2021-01-22
787987456 2019-12-06
985121122 2020-09-09
library(lubridate) #for datetime manipulation
current_date <- ymd("2021-04-30")
max_dates <- max_dates %>%
        mutate(DaysDiff = as.period(current_date - MaxDate) %>% day()) # find period and convert it to days
max_dates

Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union

A tibble: 4 × 3
CustomerID MaxDate DaysDiff
<dbl> <date> <dbl>
215454555 2021-04-30 0
456415151 2021-01-22 98
787987456 2019-12-06 511
985121122 2020-09-09 233
# lests calculate Churn feature
max_dates <- max_dates %>%
        mutate(Churn = ifelse(DaysDiff > 60, 1, 0))
max_dates
A tibble: 4 × 4
CustomerID MaxDate DaysDiff Churn
<dbl> <date> <dbl> <dbl>
215454555 2021-04-30 0 0
456415151 2021-01-22 98 1
787987456 2019-12-06 511 1
985121122 2020-09-09 233 1

You can finally merge code into one query:

churn_eval <- transactions %>%
        group_by(CustomerID) %>%
        summarise(MaxDate = max(Date)) %>%
        mutate(DaysDiff = as.period(current_date - MaxDate) %>% day(),
               Churn = ifelse(DaysDiff > 60, 1, 0)) %>%
        select(CustomerID, Churn) # Select only CustomerID and new feature / target
churn_eval # we will merge it with demographics later
A tibble: 4 × 2
CustomerID Churn
<dbl> <dbl>
215454555 0
456415151 1
787987456 1
985121122 1

25.4.2.1 Task. AverageContractSum3M_Credet

TASK: calculate average sum of contract for each customer by credet for last 3 month before Churn/NotChurn event

For calculating this data wee should do this steps:

  1. Find max transaction date
  2. Find transaction date 3 month ago (max transaction date - 3 month)
  3. Filter only records in range (-3 month, max_transaction)
  4. Group data by contracts and find every contract sum
  5. Group data by customers and find every customer average
avgContractSum3m <- transactions %>%
        filter(Type == "credet") %>% # only credet
        group_by(CustomerID) %>% 
        mutate(MaxDate = max(Date), # get max transaction date
               Month3Date = MaxDate %m+% months(-3)) %>% # get date 3 month before max transaction date
       filter(Date >= Month3Date) %>% # only transaction more than 3 month age left
       group_by(CustomerID, ContractID) %>% # group by customer and contract
       summarize(ContractSum = sum(UsdEquiv_sum), .groups = 'drop') %>% # find sum by each contract
       group_by(CustomerID) %>% # group by customer
       summarise(AverageContractSum3M_Credet = mean(ContractSum)) # find mean by each customer

avgContractSum3m
A tibble: 4 × 2
CustomerID AverageContractSum3M_Credet
<dbl> <dbl>
215454555 45541.3335
456415151 1271.5930
787987456 116.6667
985121122 4444.4444

25.4.2.2 Task. Find last 3 average credet sum increase

  1. Get dates range for -1, -2, -3 monthes by credet:
monthByMonth <- transactions %>%
        filter(Type == "credet") %>%
        select(CustomerID, Date, UsdEquiv_sum) %>%
        group_by(CustomerID) %>%
        mutate(MaxDate = max(Date),
               Month1Before = MaxDate %m+% months(-1),
               Month2Before = MaxDate %m+% months(-2),
               Month3Before = MaxDate %m+% months(-3)) 
head(monthByMonth)
A grouped_df: 6 × 7
CustomerID Date UsdEquiv_sum MaxDate Month1Before Month2Before Month3Before
<dbl> <date> <dbl> <date> <date> <date> <date>
215454555 2019-03-15 6293.7111 2021-04-30 2021-03-30 2021-02-28 2021-01-30
215454555 2019-04-05 914.9459 2021-04-30 2021-03-30 2021-02-28 2021-01-30
215454555 2019-05-30 2102.0570 2021-04-30 2021-03-30 2021-02-28 2021-01-30
215454555 2019-10-09 10854.7630 2021-04-30 2021-03-30 2021-02-28 2021-01-30
215454555 2019-11-27 639.5389 2021-04-30 2021-03-30 2021-02-28 2021-01-30
215454555 2019-12-20 2654.9459 2021-04-30 2021-03-30 2021-02-28 2021-01-30
  1. Big guery calculate sums for each month and find increases
credetInc <- monthByMonth %>%
    filter(Date >= Month1Before & Date <= MaxDate) %>% # data in range from Month1Before to MaxDate
    group_by(CustomerID) %>%
    summarise(M1Sum = sum(UsdEquiv_sum)) %>% # sum for month -1
    left_join(monthByMonth %>% # join with data about month -2
              filter(Date >= Month2Before & Date <= Month1Before) %>% # data in range from Month2Before to Month1Before
              group_by(CustomerID) %>%
              summarise(M2Sum = sum(UsdEquiv_sum)), by = "CustomerID") %>%
            left_join(monthByMonth %>% # join with data about month -3
              filter(Date >= Month3Before & Date <= Month2Before) %>% # data in range from Month3Before to Month2Before
              group_by(CustomerID) %>%
              summarise(M3Sum = sum(UsdEquiv_sum)), by = "CustomerID") %>%
    rowwise %>% # calculate separately for each row, not for columns
    mutate(AverageIncrease3M_Credet = mean(c(M1Sum/M2Sum, M2Sum/M3Sum), na.rm = T)) %>% # increase is an average of sum changinf from month to month
    select(CustomerID, AverageIncrease3M_Credet)
        
head(credetInc)
A rowwise_df: 4 × 2
CustomerID AverageIncrease3M_Credet
<dbl> <dbl>
215454555 8.3372521811
456415151 NaN
787987456 0.0003175611
985121122 NaN

Join all the tables

final_set <- demographics %>% 
        left_join(avgContractSum3m , by = "CustomerID") %>%
        left_join(credetInc , by = "CustomerID") %>%
        left_join(churn_eval, by = "CustomerID")
final_set
A data.frame: 4 × 7
CustomerID Gender Email VisitsLastYear AverageContractSum3M_Credet AverageIncrease3M_Credet Churn
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
787987456 Male Yes 12 116.6667 0.0003175611 1
456415151 Male NULL 0 1271.5930 NaN 1
215454555 Female No 16 45541.3335 8.3372521811 0
985121122 Female No 4 4444.4444 NaN 1

On the next stages this data can be transformed with scaling, encoding, binning.


25.5 References

  1. Feature Engineering in R Programming by (dhruv5819?)
  2. What is Feature Engineering? by Tim Bok
  3. Feature Scaling-Why it is required? by Rahul Saini
  4. Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization by ANIRUDDHA BHANDARI
  5. R Package ‘smbinning’: Optimal Binning for Scoring Modeling by Herman Jopia
  6. Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.
  7. An Overview on the Landscape of R Packages for Credit Scoring by Gero Szepannek