25 Feature engineering in R

You need this packages for code execution:

# install.packages("ggplot2")
install.packages("gridExtra")
# install.packages("scorecard")
# install.packages("caret")
# install.packages("gmodels")
# install.packages("superml") #you olso need R6 package

package 'gridExtra' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    D:\Temp\Rtmpum9OKD\downloaded_packages

invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

25.1 What’s Feature Engineering?

Feature engineering is the most important technique used in creating machine learning models.

Feature Engineering is a basic term used to cover many operations that are performed on the variables(features) to fit them into the algorithm. It helps in increasing the accuracy of the model thereby enhances the results of the predictions. Feature Engineered machine learning models perform better on data than basic machine learning models. The following aspects of feature engineering are as follows [1]:

Feature Scaling: It is done to get the features on the same scale( for eg. Euclidean distance).
Feature Transformation: It is done to normalize the data(feature) by a function.
Feature Construction: It is done to create new features based on original descriptors to improve the accuracy of the predictive model.

A "feature" in the context of predictive modeling is just another name for a predictor variable. Feature engineering is the general term for creating and manipulating predictors so that a good predictive model can be created.

25.2 Feature Scaling

Feature Scaling refers to putting the values in the same range or same scale so that no variable is dominated by the other.

Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.

Here’s the curious thing about feature scaling – it improves (significantly) the performance of some machine learning algorithms and does not work at all for others.

Also, what’s the difference between normalization and standardization? These are two of the most commonly used feature scaling techniques in machine learning but a level of ambiguity exists in their understanding.

25.2.1 Normalization

25.2.1.1 Theory

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Here’s the formula for normalization:

$X' = \frac{X-X_{min}}{X_{max} - X_{min}}$

Here, $X_{max}$ and $X_{min}$ are the maximum and the minimum values of the feature respectively.

When the value of $X$ is the minimum value in the column, the numerator will be $0$, and hence $X'$ is $0$.

On the other hand, when the value of $X$ is the maximum value in the column, the numerator is equal to the denominator and thus the value of $X'$ is $1$.

If the value of $X$ is between the minimum and the maximum value, then the value of $X'$ is between $0$ and $1$.

25.2.1.2 Practice

So, let’s implement own normalization function.

# Lets use client churn dataset from telco: https://www.kaggle.com/blastchar/telco-customer-churn
churn_data <- read.csv("data/telecom_users.csv")
head(churn_data)

A data.frame: 6 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	⋯	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	⋯	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	⋯	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	⋯	Yes	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No

str(churn_data)

'data.frame':   5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
 $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
 $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
 $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
 $ MonthlyCharges  : chr  "24.1" "88.15" "74.95" "55.9" ...
 $ TotalCharges    : num  1735 3973 2870 238 120 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...

# next check summary of values
summary(churn_data)

# check TotalCharges field

       X         customerID           gender          SeniorCitizen   
 Min.   :   0   Length:5986        Length:5986        Min.   :0.0000  
 1st Qu.:1777   Class :character   Class :character   1st Qu.:0.0000  
 Median :3546   Mode  :character   Mode  :character   Median :0.0000  
 Mean   :3534                                         Mean   :0.1614  
 3rd Qu.:5292                                         3rd Qu.:0.0000  
 Max.   :7042                                         Max.   :1.0000  
                                                                      
   Partner           Dependents            tenure      PhoneService      
 Length:5986        Length:5986        Min.   : 0.00   Length:5986       
 Class :character   Class :character   1st Qu.: 9.00   Class :character  
 Mode  :character   Mode  :character   Median :29.00   Mode  :character  
                                       Mean   :32.47                     
                                       3rd Qu.:56.00                     
                                       Max.   :72.00                     
                                                                         
 MultipleLines      InternetService    OnlineSecurity     OnlineBackup      
 Length:5986        Length:5986        Length:5986        Length:5986       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 DeviceProtection   TechSupport        StreamingTV        StreamingMovies   
 Length:5986        Length:5986        Length:5986        Length:5986       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Contract         PaperlessBilling   PaymentMethod      MonthlyCharges    
 Length:5986        Length:5986        Length:5986        Length:5986       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  TotalCharges       Churn          
 Min.   :  18.8   Length:5986       
 1st Qu.: 404.3   Class :character  
 Median :1412.2   Mode  :character  
 Mean   :2298.1                     
 3rd Qu.:3847.0                     
 Max.   :8684.8                     
 NA's   :10

Next, lets build histogram of Income and check how it splited with ggplot2:

#install.packages("ggplot2")
library(ggplot2)

ggplot() - function for building charts

data - first parameter - dataset

aes() - authetics - visualition axis / Construct aesthetic mappings

geom_CHART_TYPE() - set the chart type

geom_histogram() - Histograms and frequency polygons

//theme_set() - theme configutation

library(ggplot2)
ggplot(data = churn_data, aes(x=TotalCharges)) + geom_histogram(bins = 15)

# try theme

Warning message:
"Removed 10 rows containing non-finite values (stat_bin)."

# Lets replace missing with 0 zero for TotalCharges 
library(magrittr) # if pipe not loaded
library(dplyr) # for mutate function
churn_data <- churn_data %>%
            mutate(TotalCharges = ifelse(is.na(TotalCharges), 0 , TotalCharges))

ggplot(churn_data, aes(x=TotalCharges)) + geom_histogram(bins = 15)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Lets implement own normalization function by fomula explained earlie
normalizeData <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

library(dplyr)
# Normalize TotalCharges
churn_data <- churn_data |>
    filter(!is.na(TotalCharges)) |>
    mutate(TotalChargesNorm = normalizeData(TotalCharges))

churn_data %>% head() # check the last columns

A data.frame: 6 × 23
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	TotalChargesNorm
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>	<dbl>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No	0.19799792
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	⋯	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No	0.45631202
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	⋯	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes	0.32899261
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	⋯	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No	0.02535195
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No	0.01162012
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	⋯	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No	0.38672975

#summary for the last field
summary(churn_data$TotalChargesNorm)

#its from 1 to zero

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.04449 0.16078 0.26301 0.44174 1.00000

# And lets make a histogram
ggplot(churn_data, aes(x=TotalChargesNorm)) + geom_histogram(bins = 15)

We observe identical histograms even though the TotalCharges / TotalChargesNorm axis is rescaled.

Therefore we show that normalization didn’t affect the distribution properties of the rescaled data.

25.2.2 Standardization

25.2.2.1 Theory

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Here’s the formula for standardization:

$X' = \frac{X-\mu}{\sigma}$

Feature scaling: $\mu$ is the mean of the feature values and Feature scaling: $\sigma$ is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

Now, the big question in your mind must be when should we use normalization and when should we use standardization?

Normalization vs. standardization is an eternal question among machine learning newcomers. Let me elaborate on the answer in this section.

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization. However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.

It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

25.2.2.2 Practice

Lets write own function for standartization:

# if sdev is NA - calculate start deviation from data

standartize <- function(data) {    
    sdev = sd(data, na.rm = TRUE)  
    data <- (data - mean(data, na.rm = T)) / sdev
    return (data)    
}

# Normalize TotalCharges
churn_data <- churn_data %>%
    mutate(TotalChargesStand = standartize(TotalCharges))

churn_data %>% head() # check the last columns

A data.frame: 6 × 24
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	TotalChargesNorm	TotalChargesStand
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>	<dbl>	<dbl>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No	0.19799792	-0.2477481
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	⋯	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No	0.45631202	0.7366076
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	⋯	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes	0.32899261	0.2514325
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	⋯	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No	0.02535195	-0.9056488
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	No	No	Month-to-month	No	Electronic check	53.45	119.50	No	0.01162012	-0.9579766
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	⋯	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No	0.38672975	0.4714509

Lets compare data distribution fot normalization and standartization.

#install.packages("gridExtra") # to view 2+ ggplots 
library(gridExtra)
n_plot <- ggplot(churn_data, aes(x=TotalChargesNorm)) + geom_histogram(bins = 15)
s_plot <- ggplot(churn_data, aes(x=TotalChargesStand)) + geom_histogram(bins = 15)
init_plot <- ggplot(churn_data, aes(x=TotalCharges)) + geom_histogram(bins = 15)
grid.arrange(n_plot, init_plot, s_plot, ncol=3) # from gridExtra

# data distribution changed after standartisation scaling


Attaching package: 'gridExtra'


The following object is masked from 'package:dplyr':

    combine

So, lets use stardart R function for scaling and compare results:

# the next 
churn_data <- churn_data %>%
        mutate(TotalChargesScaled = as.numeric(scale(TotalCharges))) 

s1_plot <- ggplot(churn_data, aes(x=TotalChargesStand)) + geom_histogram(bins = 15)
s2_plot <- ggplot(churn_data, aes(x=TotalChargesScaled)) + geom_histogram(bins = 15)
grid.arrange(s1_plot, s2_plot, ncol=2) # it looks like we created the same function

mean(churn_data$TotalCharges)

2298.06061746988

Look like default function has the same result as our.

head(churn_data) # check last two columns, its the same

A data.frame: 6 × 25
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	...	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	TotalChargesNorm	TotalChargesStand	TotalChargesScaled
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	...	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	...	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No	0.19973402	-0.2460559	-0.2460559
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	...	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No	0.45748895	0.7382838	0.7382838
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	...	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes	0.33044515	0.2531165	0.2531165
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	...	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No	0.02746177	-0.9039460	-0.9039460
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	...	No	Month-to-month	No	Electronic check	53.45	119.50	No	0.01375967	-0.9562729	-0.9562729
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	...	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No	0.38805730	0.4731314	0.4731314

25.2.3 Scaling for train/test/validation/prediction

If you use scaling initial parameters should be remembered somewhere for future prediction data and reimplemented for new/test/validation/prediction dataset

For experiment lets split our dataset for train and test:

library(caret)
set.seed(2021)
 
index = createDataPartition(churn_data$Churn, p = 0.70, list = FALSE)
train = churn_data[index, ]
test = churn_data[-index, ]

nrow(churn_data)
nrow(train)
nrow(test)

Loading required package: lattice

5976

4184

1792

Lets rescale TotalCharges data for training set:

train <- train %>% mutate(TotalChargesScaled = scale(TotalCharges))
head(train) # you can see that TotalChangesStand and TotalChangesScaled are different becouse of changed mean and standart deviation of data

A data.frame: 6 × 25
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	TotalChargesNorm	TotalChargesStand	TotalChargesScaled
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>	<dbl>	<dbl>	<dbl[,1]>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No	0.19799792	-0.2477481	-0.2500210
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	No	Month-to-month	No	Electronic check	53.45	119.50	No	0.01162012	-0.9579766	-0.9563465
9	1480	8898-KASCD	Male	0	No	No	39	No	No phone service	DSL	⋯	No	One year	No	Mailed check	35.55	1309.15	No	0.14889799	-0.4348528	-0.4360975
10	5137	8016-NCFVO	Male	1	No	No	55	Yes	Yes	Fiber optic	⋯	Yes	Month-to-month	Yes	Electronic check	116.5	6382.55	No	0.73433533	1.7960690	1.7825644
11	3169	4578-PHJYZ	Male	0	Yes	Yes	52	Yes	No	DSL	⋯	No	One year	Yes	Electronic check	68.75	3482.85	No	0.39972883	0.5209864	0.5144889
12	4653	2091-MJTFX	Female	0	Yes	Yes	30	No	No phone service	DSL	⋯	Yes	Month-to-month	No	Credit card (automatic)	51.2	1561.50	Yes	0.17801754	-0.3238872	-0.3257417

So, for train, test and prediction data we should use the same scaling base, in this case mean and standart deviation.

Correct data scaling code should be like this:

# fix mean and sd
meanTotalCharges = mean(train$TotalCharges, na.rm = T)
sdTotalCharges = sd(train$TotalCharges, na.rm = T)

train <- train %>% mutate(TotalChargesScaled = scale(TotalCharges, center = meanTotalCharges, scale = sdTotalCharges)) # default
test <- test %>% mutate(TotalChargesScaled = scale(TotalCharges, center = meanTotalCharges, scale = sdTotalCharges)) # use parameters of train set

sd(train$TotalChargesScaled)
sd(test$TotalChargesScaled)

#check the same value TotalCharges == 0 in train and set
head(train %>% filter(TotalCharges == 0))
head(test %>% filter(TotalCharges == 0))

0.981778490083862

ERROR while rich displaying an object: Error in apply(apply(col, 2L, format), 1L, paste, collapse = ", "): dim(X) must have a positive length

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
2. tryCatchList(expr, classes, parentenv, handlers)
3. tryCatchOne(expr, names, parentenv, handlers[[1L]])
4. doTryCatch(return(expr), name, parentenv, handler)
5. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
6. repr::mime2repr[[mime]](obj)
7. repr_text.data.frame(obj)
8. ellip_limit_arr(obj, rows, cols)
9. arr_parts_format(parts)
10. structure(lapply(parts, arr_part_format), omit = attr(parts, 
  .     "omit"))
11. lapply(parts, arr_part_format)
12. FUN(X[[i]], ...)
13. vapply(part, function(col) {
  .     if (is.matrix(col)) 
  .         apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
  .     else format(col)
  . }, character(nrow(part)))
14. FUN(X[[i]], ...)
15. apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
16. stop("dim(X) must have a positive length")
ERROR while rich displaying an object: Error in apply(apply(col, 2L, format), 1L, paste, collapse = ", "): dim(X) must have a positive length

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
2. tryCatchList(expr, classes, parentenv, handlers)
3. tryCatchOne(expr, names, parentenv, handlers[[1L]])
4. doTryCatch(return(expr), name, parentenv, handler)
5. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
6. repr::mime2repr[[mime]](obj)
7. repr_text.data.frame(obj)
8. ellip_limit_arr(obj, rows, cols)
9. arr_parts_format(parts)
10. structure(lapply(parts, arr_part_format), omit = attr(parts, 
  .     "omit"))
11. lapply(parts, arr_part_format)
12. FUN(X[[i]], ...)
13. vapply(part, function(col) {
  .     if (is.matrix(col)) 
  .         apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
  .     else format(col)
  . }, character(nrow(part)))
14. FUN(X[[i]], ...)
15. apply(apply(col, 2L, format), 1L, paste, collapse = ", ")
16. stop("dim(X) must have a positive length")

#compare it with all dataset TotalCharges == 0

filter(churn_data, TotalCharges == 0)
# for now TotalChargesScaled in train/test th same, but in churn data its different, because of diffrent scaling bases

A data.frame: 10 × 25
X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	...	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	TotalChargesNorm	TotalChargesStand	TotalChargesScaled
<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	...	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
6754	2775-SEFEE	Male	0	No	Yes	0	Yes	Yes	DSL	...	No	Two year	Yes	Bank transfer (automatic)	61.9	0	No	0	-1.00882	-1.00882
1340	1371-DWPAZ	Female	0	Yes	Yes	0	No	No phone service	DSL	...	No	Two year	No	Credit card (automatic)	56.05	0	No	0	-1.00882	-1.00882
3826	3213-VVOLG	Male	0	Yes	Yes	0	Yes	Yes	No	...	No internet service	Two year	No	Mailed check	25.35	0	No	0	-1.00882	-1.00882
5218	2923-ARZLG	Male	0	Yes	Yes	0	Yes	No	No	...	No internet service	One year	Yes	Mailed check	19.7	0	No	0	-1.00882	-1.00882
3331	7644-OMVMY	Male	0	Yes	Yes	0	Yes	No	No	...	No internet service	Two year	No	Mailed check	19.85	0	No	0	-1.00882	-1.00882
936	5709-LVOEQ	Female	0	Yes	Yes	0	Yes	No	DSL	...	Yes	Two year	No	Mailed check	80.85	0	No	0	-1.00882	-1.00882
753	3115-CZMZD	Male	0	No	Yes	0	Yes	No	No	...	No internet service	Two year	No	Mailed check	20.25	0	No	0	-1.00882	-1.00882
4380	2520-SGTTA	Female	0	Yes	Yes	0	Yes	No	No	...	No internet service	Two year	No	Mailed check	20.0	0	No	0	-1.00882	-1.00882
488	4472-LVYGI	Female	0	Yes	Yes	0	No	No phone service	DSL	...	No	Two year	Yes	Bank transfer (automatic)	52.55	0	No	0	-1.00882	-1.00882
1082	4367-NUYAO	Male	0	Yes	Yes	0	Yes	Yes	No	...	No internet service	Two year	No	Mailed check	25.75	0	No	0	-1.00882	-1.00882

25.3 Feature Transformation

Feature transformation involves manipulating a predictor variable in some way so as to improve its performance in the predictive model. A variety of considerations come into play when transforming models, including:

The flexibility of machine learning and statistical models in dealing with different types of data. For example, some techniques require that the input data be in numeric format, whereas others can deal with other formats, such as categorical, text, or dates.
Ease of interpretation. A predictive model where all the predictors are on the same scale (e.g., have a mean of 0 and a standard deviation of 1), can make interpretation easier.
Predictive accuracy. Some transformations of variables can improve the accuracy of prediction (e.g., rather than including a numeric variable as a predictor, instead include both it and a second variable that is its square).
Theory. For example, economic theory dictates that in many situations the natural logarithm of data representing prices and quantities should be used.
Computational error. Many algorithms are written in such a way that “large” numbers cause them to give the wrong result, where “large” may not be so large (e.g., more than 10 or less than -10).

25.3.1 Scaling based on calculations

Sometimes for changing data distribution before using in modeling or change correlation between input and output variables scientist changes data type with standart mathematical functions. Lets try transform TotalCharges with logarithm, sqrt and power up 2.

library(gridExtra)
churn_data_tmp <- churn_data %>%
        mutate(TotalChargesLog = log(TotalCharges),
              TotalChargesSqrt = sqrt(TotalCharges),
              TotalChargesPow2 = TotalCharges^2)

plot1 <- ggplot(churn_data_tmp, aes(x=TotalChargesLog)) + geom_histogram(bins = 15)
plot2 <- ggplot(churn_data_tmp, aes(x=TotalChargesSqrt)) + geom_histogram(bins = 15)
plot3 <- ggplot(churn_data_tmp, aes(x=TotalChargesPow2)) + geom_histogram(bins = 15)
grid.arrange(plot1, plot2, plot3, ncol=3)

Lets try

#install.packages("moments")

package 'moments' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    D:\Temp\Rtmpum9OKD\downloaded_packages

library(moments)

skewness(churn_data_tmp$TotalCharges)
skewness(churn_data_tmp$TotalChargesLog)
skewness(churn_data_tmp$TotalChargesSqrt)
skewness(churn_data_tmp$TotalChargesPow2)

0.949325175892764

-0.75144252264496

0.302580803403444

1.80376407354267

Conclusion: different scaling gives different data distribution and may improve model perfomance if you have found the correct form of dependence of input/ouput parameters.

25.4 Feature Construction

The feature Construction method helps in creating new features in the data thereby increasing model accuracy and overall predictions. It is of two types:

Binning: Bins are created for continuous variables.
Encoding: Numerical variables or features are formed from categorical variables.

Додано мною, бо не знайшов ніде такого прийому) Всі одразу моделюють)) - [x] Evaluation - construction new features on raw data from datasource.

25.4.1 Binning

Binning is done to create bins for continuous variables where they are converted to categorical variables.

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
It controls or mitigates the impact of outliers over the model.
It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

There are two types of binning: Unsupervised and Supervised.

Unsupervised Binning involves Automatic and Manual binning. In Automatic Binning, bins are created without human interference and are created automatically. In Manual Binning, bins are created with human interference and we specify where the bins to be created.

Supervised Binning involves creating bins for the continuous variable while taking the target variable into the consideration also. Supervised Discretization or Binning divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.

In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available, analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

There are many packages for creating new variables: smbinning, scorecard, rbin, InformationValue and other.

25.4.1.1 WOE binning: theory

Weight of evidence (WOE)

This is basically a technique that can be applied if we have a binary response variable and any kind of predictor variable. First we perform a reasonable binning on the response variable and then decide which form of the binary response we count as positive and which as negative. Then we calculate the percentage positive cases in each bin of the total of all positive cases. For example 20 positive cases in bin A out of 100 total positive cases in all bins equals 20 %. Next we calculate the percentage of negative cases in each bin of the total of all negative cases, for example 5 negative cases in bin A out of a total of 50 negative cases in all bins equals 10%. Then we calculate the WOE by dividing the bin percentages of positive cases by the bin percentage of negative cases, and take the logarithm. For the described example log(20/10).

Rule of thump: If WOE values are negative, negative cases supersede the positive cases. If WOE values are positive, positive cases supersede the negative cases.

This serves the following purposes:

We eliminate any none-linear relationships
We automatically scale all variables too some extend
We convert categorical variables to contineous variables
Missing Data can be handled as just another factor value
We can built a stand alone score card, that could be manually applied by a person with a pen and a printout of all relevant variables.

It has the following disadvantages:

We always loose information via binning
Score development along single variables is not contineous and occurs in steps
Binning requires manual revision
Calculating Variable importance is not as straight forward as with classical logistic regression with regularly scaled variables

Information Value (IV)

By doing another sequence of calculations similar to the WOE calculation we can calculate the IV. Classically this serves as variable ranking method and allows us to perform feature selection, which is less compuationally demanding as other methods.

Information Value	Predictive Power
< 0.02	useless for prediction
0.02 - 0.1	weak predictor
0.1 - 0.3	medium predictor
0.3 - 0.5	strong predictor
> 0.5	suspicious too good to be true

After calculating WOE value it replaces the original values in dataset.

25.4.1.2 `scorecard` package and `woebin()`-function

woebin generates optimal binning for numerical, factor and categorical variables using methods including tree-like segmentation or chi-square merge. woebin can also customizing breakpoints if the breaks_list was provided. The default woe is defined as ln(Pos_i/Neg_i). If you prefer ln(Neg_i/Pos_i), please set the argument positive as negative value, such as ‘0’ or ‘good’. If there is a zero frequency class when calculating woe, the zero will replaced by 0.99 to make the woe calculable.

# lets try to bin InternetService, TotalCharges from churn_data_tmp
churn_data_tmp <- churn_data %>%
        mutate(Churn = ifelse(Churn == "Yes", 1, 0))

bin_data <- churn_data_tmp %>% select(customerID, InternetService, TotalCharges, Churn)
head(bin_data)

#churn_data_tmp%>%select(Churn) %>% distinct()

A data.frame: 6 × 4
	customerID	InternetService	TotalCharges	Churn
	<chr>	<chr>	<dbl>	<dbl>
1	7010-BRBUU	No	1734.65	0
2	9688-YGXVR	Fiber optic	3973.20	0
3	9286-DOJGF	Fiber optic	2869.85	1
4	6994-KERXL	DSL	238.50	0
5	2181-UAESM	DSL	119.50	0
6	4312-GVYNH	DSL	3370.20	0

# install.packages("scorecard")
library(scorecard)

bins = woebin(bin_data, # dataset
              y = 'Churn', # target variable
              x = c("InternetService", "TotalCharges")) # variables for binning

[INFO] creating woe binning ...

# lets view our bins
bins

$InternetService

A data.table: 3 × 12
variable	bin	count	count_distr	neg	pos	posprob	woe	bin_iv	total_iv	breaks	is_special_values
<chr>	<chr>	<int>	<dbl>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
InternetService	No	1291	0.2156699	1192	99	0.07668474	-1.4687362	0.30636195	0.5897126	No	FALSE
InternetService	DSL	2068	0.3454728	1671	397	0.19197292	-0.4177094	0.05417755	0.5897126	DSL	FALSE
InternetService	Fiber optic	2627	0.4388573	1536	1091	0.41530263	0.6774449	0.22917306	0.5897126	Fiber optic	FALSE

$TotalCharges

A data.table: 4 × 12
variable	bin	count	count_distr	neg	pos	posprob	woe	bin_iv	total_iv	breaks	is_special_values
<chr>	<chr>	<int>	<dbl>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
TotalCharges	[-Inf,200)	1004	0.16772469	517	487	0.4850598	0.9597530	0.181721173	0.3097399	200	FALSE
TotalCharges	[200,400)	488	0.08152355	327	161	0.3299180	0.3109760	0.008431865	0.3097399	400	FALSE
TotalCharges	[400,3800)	2981	0.49799532	2266	715	0.2398524	-0.1339571	0.008651146	0.3097399	3800	FALSE
TotalCharges	[3800, Inf)	1513	0.25275643	1289	224	0.1480502	-0.7304442	0.110935711	0.3097399	Inf	FALSE

churn_data_tmp %>% filter(TotalCharges > 3800) %>% select(Churn) %>% group_by(Churn) %>% summarize(n())

A tibble: 2 × 2
Churn	n()
<dbl>	<int>
0	1289
1	224

bins$TotalCharges %>% knitr::kable() # better view for RStudio, need knitr to be installed



|variable     |bin         | count| count_distr|  neg| pos|   posprob|        woe|    bin_iv|  total_iv|breaks |is_special_values |
|:------------|:-----------|-----:|-----------:|----:|---:|---------:|----------:|---------:|---------:|:------|:-----------------|
|TotalCharges |[-Inf,200)  |  1004|   0.1677247|  517| 487| 0.4850598|  0.9597530| 0.1817212| 0.3097399|200    |FALSE             |
|TotalCharges |[200,400)   |   488|   0.0815236|  327| 161| 0.3299180|  0.3109760| 0.0084319| 0.3097399|400    |FALSE             |
|TotalCharges |[400,3800)  |  2981|   0.4979953| 2266| 715| 0.2398524| -0.1339571| 0.0086511| 0.3097399|3800   |FALSE             |
|TotalCharges |[3800, Inf) |  1513|   0.2527564| 1289| 224| 0.1480502| -0.7304442| 0.1109357| 0.3097399|Inf    |FALSE             |

# preview the plot
woebin_plot(bins$TotalCharges)

$TotalCharges

If TotalCharges less than 200 its most risky group of customers, 48.5% of them are potential churn.

bins$InternetService %>% knitr::kable() 
woebin_plot(bins$InternetService)



|variable        |bin         | count| count_distr|  neg|  pos|   posprob|        woe|    bin_iv|  total_iv|breaks      |is_special_values |
|:---------------|:-----------|-----:|-----------:|----:|----:|---------:|----------:|---------:|---------:|:-----------|:-----------------|
|InternetService |No          |  1291|   0.2156699| 1192|   99| 0.0766847| -1.4687362| 0.3063620| 0.5897126|No          |FALSE             |
|InternetService |DSL         |  2068|   0.3454728| 1671|  397| 0.1919729| -0.4177094| 0.0541776| 0.5897126|DSL         |FALSE             |
|InternetService |Fiber optic |  2627|   0.4388573| 1536| 1091| 0.4153026|  0.6774449| 0.2291731| 0.5897126|Fiber optic |FALSE             |

$InternetService

Fiber optic InternetService customers are the most risky. 41.5% of them are potentiona churn.

The next stage is applying bins to th the variables:

bin_data_woe = woebin_ply(bin_data, bins) 
head(bin_data)
head(bin_data_woe) # compare how it changed

[INFO] converting into woe values ...

A data.frame: 6 × 4
	customerID	InternetService	TotalCharges	Churn
	<chr>	<chr>	<dbl>	<dbl>
1	7010-BRBUU	No	1734.65	0
2	9688-YGXVR	Fiber optic	3973.20	0
3	9286-DOJGF	Fiber optic	2869.85	1
4	6994-KERXL	DSL	238.50	0
5	2181-UAESM	DSL	119.50	0
6	4312-GVYNH	DSL	3370.20	0

A data.table: 6 × 4
customerID	Churn	InternetService_woe	TotalCharges_woe
<chr>	<dbl>	<dbl>	<dbl>
7010-BRBUU	0	-1.4687362	-0.1339571
9688-YGXVR	0	0.6774449	-0.7304442
9286-DOJGF	1	0.6774449	-0.1339571
6994-KERXL	0	-0.4177094	0.3109760
2181-UAESM	0	-0.4177094	0.9597530
4312-GVYNH	0	-0.4177094	-0.1339571

bins$InternetService #No is replaced by WOE -1.4687362 (line 1)

A data.table: 3 × 12
variable	bin	count	count_distr	neg	pos	posprob	woe	bin_iv	total_iv	breaks	is_special_values
<chr>	<chr>	<int>	<dbl>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
InternetService	No	1291	0.2156699	1192	99	0.07668474	-1.4687362	0.30636195	0.5897126	No	FALSE
InternetService	DSL	2068	0.3454728	1671	397	0.19197292	-0.4177094	0.05417755	0.5897126	DSL	FALSE
InternetService	Fiber optic	2627	0.4388573	1536	1091	0.41530263	0.6774449	0.22917306	0.5897126	Fiber optic	FALSE

So, both numerical and categorical variables are binned by woebin function.

In real-life project you should dou binning with the next steps:

Clean data
Split data into train/test (+prediction)
Create bins on train set (and save them if your model not one-time used)
Apply bins to all sets you have

Lest see how to make it with our dataset churn_data.

library(caret)
library(gmodels)
set.seed(2021) # to fix split options

churn_data <- read.csv("../../data/telecom_users.csv") # read data
churn_data <- churn_data %>% 
    select(customerID, gender, PaymentMethod, TotalCharges, Churn) %>% # select some columns for test + target
    mutate(Churn = ifelse(Churn == "Yes", 1, 0)) # replace Churn Yes/No with 1/0 - Event/NonEvent
head(churn_data)

A data.frame: 6 × 5
	customerID	gender	PaymentMethod	TotalCharges	Churn
	<chr>	<chr>	<chr>	<dbl>	<dbl>
1	7010-BRBUU	Male	Credit card (automatic)	1734.65	0
2	9688-YGXVR	Female	Credit card (automatic)	3973.20	0
3	9286-DOJGF	Female	Bank transfer (automatic)	2869.85	1
4	6994-KERXL	Male	Electronic check	238.50	0
5	2181-UAESM	Male	Electronic check	119.50	0
6	4312-GVYNH	Female	Bank transfer (automatic)	3370.20	0

index = createDataPartition(churn_data$Churn, p = 0.60, list = FALSE) # select randomly indexes of the rows for train
train = churn_data[index, ]
test = churn_data[-index, ]

data_bins = woebin(train, # dataset
                   y = "Churn", # target
                   var_skip = "customerID" # skip ID
                  # x = c("gender", "PaymentMethod", "TotalCharges") # select some varables
                   #var_skip = "customerID" # target variable - not working in jupyter
             )

[INFO] creating woe binning ...

data_bins

$gender

A data.table: 2 × 12
variable	bin	count	count_distr	neg	pos	posprob	woe	bin_iv	total_iv	breaks	is_special_values
<chr>	<chr>	<int>	<dbl>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
gender	Female	1770	0.4927617	1314	456	0.2576271	-0.02546718	0.0003176558	0.0006226087	Female	FALSE
gender	Male	1822	0.5072383	1335	487	0.2672887	0.02444876	0.0003049529	0.0006226087	Male	FALSE

$PaymentMethod

A data.table: 3 × 12
variable	bin	count	count_distr	neg	pos	posprob	woe	bin_iv	total_iv	breaks	is_special_values
<chr>	<chr>	<int>	<dbl>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
PaymentMethod	Credit card (automatic)%,%Bank transfer (automatic)	1562	0.4348552	1312	250	0.1600512	-0.6249758	0.14385062	0.4324699	Credit card (automatic)%,%Bank transfer (automatic)	FALSE
PaymentMethod	Mailed check	822	0.2288419	665	157	0.1909976	-0.4106700	0.03472141	0.4324699	Mailed check	FALSE
PaymentMethod	Electronic check	1208	0.3363029	672	536	0.4437086	0.8067470	0.25389789	0.4324699	Electronic check	FALSE

$TotalCharges

A data.table: 5 × 12
variable	bin	count	count_distr	neg	pos	posprob	woe	bin_iv	total_iv	breaks	is_special_values
<chr>	<chr>	<int>	<dbl>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
TotalCharges	missing	6	0.001670379	6	0	0.00000000	-0.7699879	0.0009365099	0.3222676	missing	TRUE
TotalCharges	[-Inf,400)	893	0.248608018	505	388	0.43449048	0.7682688	0.1693136398	0.3222676	400	FALSE
TotalCharges	[400,3200)	1601	0.445712695	1207	394	0.24609619	-0.0877204	0.0033567382	0.3222676	3200	FALSE
TotalCharges	[3200,6600)	850	0.236636971	711	139	0.16352941	-0.6003766	0.0727391340	0.3222676	6600	FALSE
TotalCharges	[6600, Inf)	242	0.067371938	220	22	0.09090909	-1.2707632	0.0759215884	0.3222676	Inf	FALSE

train_woe = woebin_ply(train, data_bins) 
head(train_woe)
test_woe = woebin_ply(test, data_bins) 
head(test_woe)

[INFO] converting into woe values ...

A data.table: 6 × 5
customerID	Churn	gender_woe	PaymentMethod_woe	TotalCharges_woe
<chr>	<dbl>	<dbl>	<dbl>	<dbl>
7010-BRBUU	0	0.02444876	-0.6249758	-0.0877204
4367-NHWMM	0	-0.02546718	-0.4106700	0.7682688
8016-NCFVO	0	0.02444876	0.8067470	-0.6003766
4578-PHJYZ	0	0.02444876	0.8067470	-0.6003766
2091-MJTFX	1	-0.02546718	-0.6249758	-0.0877204
2277-DJJDL	0	0.02444876	0.8067470	-0.6003766

[INFO] converting into woe values ...

A data.table: 6 × 5
customerID	Churn	gender_woe	PaymentMethod_woe	TotalCharges_woe
<chr>	<dbl>	<dbl>	<dbl>	<dbl>
9688-YGXVR	0	-0.02546718	-0.6249758	-0.6003766
9286-DOJGF	1	-0.02546718	-0.6249758	-0.0877204
6994-KERXL	0	0.02444876	0.8067470	0.7682688
2181-UAESM	0	0.02444876	0.8067470	0.7682688
4312-GVYNH	0	-0.02546718	-0.6249758	-0.6003766
2495-KZNFB	0	-0.02546718	0.8067470	-0.0877204

For now our data is ready for modeling.

25.4.1.3 Encoding

Encoding is the process in which numerical variables or features are created from categorical variables. It is a widely used method in the industry and in every model building process. It is of two types: Label Encoding and One-hot Encoding.

Label Encoding involves assigning each label a unique integer or value based on alphabetical ordering. It is the most popular and widely used encoding.

One-hot Encoding involves creating additional features or variables on the basis of unique values in categorical variables i.e. every unique value in the category will be added as a new feature.

25.4.1.3.1 Label encoding

# Lets use client churn dataset from telco: https://www.kaggle.com/blastchar/telco-customer-churn
churn_data <- read.csv("data/telecom_users.csv")
head(churn_data)

A data.frame: 6 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	⋯	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	⋯	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	⋯	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	⋯	Yes	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No

25.4.1.3.2 Encoding with factors

Its good way to encode labels as factors in R. This approach used for categorical data encoding. Easiest way is convert character values to factor and later convert factor to numeric:

churn_data <- churn_data %>% mutate(Partner = as.factor(Partner),
                                   Dependents = as.factor(Dependents))
head(churn_data) # data types changed for selected fields

A data.frame: 6 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<fct>	<fct>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	⋯	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	⋯	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	⋯	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	⋯	Yes	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No

str(churn_data$Partner) # check levels

 Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 1 1 ...

# convert to integers
churn_data <- churn_data %>% mutate(Partner = as.integer(Partner),
                                   Dependents = as.integer(Dependents))
head(churn_data) # data types changed for selected fields

A data.frame: 6 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<int>	<int>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	2	2	72	Yes	Yes	No	⋯	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No
2	4528	9688-YGXVR	Female	0	1	1	44	Yes	No	Fiber optic	⋯	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No
3	6344	9286-DOJGF	Female	1	2	1	38	Yes	Yes	Fiber optic	⋯	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
4	6739	6994-KERXL	Male	0	1	1	4	Yes	No	DSL	⋯	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No
5	432	2181-UAESM	Male	0	1	1	2	Yes	No	DSL	⋯	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No
6	2215	4312-GVYNH	Female	0	2	1	70	No	No phone service	DSL	⋯	Yes	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No

But you should check factor to numbers converting to synchronize train/test/prediction correct numbering. In this case you should save somewhere levels of factor. Better way is to use encoders with fitting.

25.4.1.3.3 Encoding with LabelEncoder from `superml` package

# install.packages("superml") #you olso need R6 package

also installing the dependency 'BH'

package 'BH' successfully unpacked and MD5 sums checked
package 'superml' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    D:\Temp\Rtmpum9OKD\downloaded_packages

library(superml)

label <- LabelEncoder$new() # create new encoder
print(label$fit(churn_data$InternetService)) # fir the encoder

Loading required package: R6

[1] TRUE

# encode data
churn_data$InternetService <- label$fit_transform(churn_data$InternetService) 
# do not re-run it, because InternetService ineteger for now
head(churn_data)

A data.frame: 6 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<int>	<int>	<int>	<chr>	<chr>	<dbl>	...	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	2	2	72	Yes	Yes	0	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No
2	4528	9688-YGXVR	Female	0	1	1	44	Yes	No	1	...	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No
3	6344	9286-DOJGF	Female	1	2	1	38	Yes	Yes	1	...	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
4	6739	6994-KERXL	Male	0	1	1	4	Yes	No	2	...	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No
5	432	2181-UAESM	Male	0	1	1	2	Yes	No	2	...	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No
6	2215	4312-GVYNH	Female	0	2	1	70	No	No phone service	2	...	Yes	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No

Now you can save your encoding configuration and use it for test/prediction sets later.

25.4.1.4 One-hot encoding

# Lets use client churn dataset from telco: https://www.kaggle.com/blastchar/telco-customer-churn
churn_data <- read.csv("data/telecom_users.csv")
head(churn_data)

A data.frame: 6 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	⋯	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	⋯	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.1	1734.65	No
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	⋯	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No
3	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	⋯	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
4	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	⋯	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.9	238.50	No
5	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	⋯	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.50	No
6	2215	4312-GVYNH	Female	0	Yes	No	70	No	No phone service	DSL	⋯	Yes	Yes	No	Yes	Two year	Yes	Bank transfer (automatic)	49.85	3370.20	No

# check possible Gender values
gmodels::CrossTable(churn_data$gender)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5986 

 
          |    Female |      Male | 
          |-----------|-----------|
          |      2936 |      3050 | 
          |     0.490 |     0.510 | 
          |-----------|-----------|

Lets create two additional variables Male and Female encoded by 1 / 0:

Create a Male column that encodes each row corresponding to male as 1 and set everything else to 0.
Create a Female column that encodes each row corresponding to female as 1 and set everything else to 0.

# if gender is factor
#churn_data <- churn_data %>% 
   #     mutate(gender = as.character(gender)) # convert Gender to character from Factor

churn_data <- churn_data %>% 
        mutate(
            # create Male column
            Male = ifelse(gender == "Male", 1, 0),
            # create Female column
            Female = ifelse(gender == "Female", 1, 0))

churn_data |> select(gender, Male, Female) |> head()

A data.frame: 6 × 3
	gender	Male	Female
	<chr>	<dbl>	<dbl>
1	Male	1	0
2	Female	0	1
3	Female	0	1
4	Male	1	0
5	Male	1	0
6	Female	0	1

Lets create a dummy variables for InternetService column. So, what data it has for now?

gmodels::CrossTable(churn_data$InternetService)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5986 

 
            |         DSL | Fiber optic |          No | 
            |-------------|-------------|-------------|
            |        2068 |        2627 |        1291 | 
            |       0.345 |       0.439 |       0.216 | 
            |-------------|-------------|-------------|

The function dummyVars() from caret package can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.

library(caret)

churn_data <- read.csv("data/telecom_users.csv")

dummy <- dummyVars(" ~ InternetService", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()

A data.frame: 6 × 3
	InternetServiceDSL	InternetServiceFiber.optic	InternetServiceNo
	<dbl>	<dbl>	<dbl>
1	0	0	1
2	0	1	0
3	0	1	0
4	1	0	0
5	1	0	0
6	1	0	0

If you want create dummy variables for all categoriacal columns just use " ~ ." formula.

dummy <- dummyVars(" ~ .", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()

A data.frame: 6 × 7560
	X	customerID0002.ORFBO	customerID0003.MKNFE	customerID0004.TLHLJ	customerID0011.IGKFF	customerID0013.EXCHZ	customerID0013.MHZWF	customerID0013.SMEOE	customerID0014.BMAQU	customerID0015.UOCOJ	⋯	MonthlyCharges99.7	MonthlyCharges99.75	MonthlyCharges99.8	MonthlyCharges99.85	MonthlyCharges99.9	MonthlyCharges99.95	MonthlyChargesNULL	TotalCharges	ChurnNo	ChurnYes
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	⋯	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1869	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	1734.65	1	0
2	4528	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	3973.20	1	0
3	6344	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	2869.85	0	1
4	6739	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	238.50	1	0
5	432	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	119.50	1	0
6	2215	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	3370.20	1	0

You can remove some variables, for example Churn, becouse its target variable

dummy <- dummyVars(" ~. -Churn", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()

A data.frame: 6 × 7558
	X	customerID0002.ORFBO	customerID0003.MKNFE	customerID0004.TLHLJ	customerID0011.IGKFF	customerID0013.EXCHZ	customerID0013.MHZWF	customerID0013.SMEOE	customerID0014.BMAQU	customerID0015.UOCOJ	...	MonthlyCharges99.6	MonthlyCharges99.65	MonthlyCharges99.7	MonthlyCharges99.75	MonthlyCharges99.8	MonthlyCharges99.85	MonthlyCharges99.9	MonthlyCharges99.95	MonthlyChargesNULL	TotalCharges
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1869	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1734.65
2	4528	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	3973.20
3	6344	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	2869.85
4	6739	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	238.50
5	432	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	119.50
6	2215	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	3370.20

If you need some few do this with + operator:

dummy <- dummyVars(" ~ InternetService + PhoneService", data = churn_data)
new_df <- data.frame(predict(dummy, newdata = churn_data)) # precit dummy variables and
new_df %>% head()

A data.frame: 6 × 5
	InternetServiceDSL	InternetServiceFiber.optic	InternetServiceNo	PhoneServiceNo	PhoneServiceYes
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	0	0	1	0	1
2	0	1	0	0	1
3	0	1	0	0	1
4	1	0	0	0	1
5	1	0	0	0	1
6	1	0	0	1	0

25.4.2 Evaluate data on raw dataset

Previous data transformation approaches use prepared datasets. But sometimes you should create new variables from raw data than cannot be attached to you existing dataset and implemented as-is for modeling.

Lest check an example with customer transactions data.

Preview data file in excel before read it to R!

# read transaction data from dataset
# explore file in Excel before reading to check sheet numbers and tables structure
library(xlsx)
demographics <- read.xlsx("data/transactions.xlsx", sheetIndex = 1)
head(demographics)

java.home option: 

JAVA_HOME environment variable: C:\Program Files\Java\jdk-19\bin;C:\Program Files\Common Files\Oracle\Java\javapath\java.exe;C:\Program Files\Common Files\Oracle\Java\javapath\;

Warning message in fun(libname, pkgname):
"Java home setting is INVALID, it will be ignored.
Please do NOT set it unless you want to override system settings."

A data.frame: 4 × 4
	CustomerID	Gender	Email	VisitsLastYear
	<dbl>	<chr>	<chr>	<dbl>
1	787987456	Male	Yes	12
2	456415151	Male	NULL	0
3	215454555	Female	No	16
4	985121122	Female	No	4

# read transactions
transactions <- read.xlsx("data/transactions.xlsx", sheetIndex = 2)
head(transactions)

A data.frame: 6 × 7
	CustomerID	ContractID	Date	Time	ContractorID	UsdEquiv_sum	Type
	<dbl>	<dbl>	<date>	<dbl>	<chr>	<dbl>	<chr>
1	215454555	19065798	2019-03-15	114537	NULL	6293.7111	credet
2	215454555	19065798	2019-04-05	102525	NULL	914.9459	credet
3	215454555	19065798	2019-05-11	80833	NULL	-4655.9111	debet
4	215454555	19065798	2019-05-16	74606	NULL	-13900.6889	debet
5	215454555	19065798	2019-05-30	104506	NULL	2102.0570	credet
6	215454555	19065798	2019-08-23	122656	NULL	-16244.2333	debet

Our nex task is to construct new features based on transactions history:

Churn - target parameter if client has more than 60 days absent transactions between 2021-04-30 and last transaction date.
AverageContractSum3M_Credet - average contract sum by credet last 3 month before event Churn == 1/0
AverageIncrease3M_Credet - average sum increase from month to month, for last 3 month, by credet

# lets find a maximum transaction date for each client
library(tidyverse) # includes magrittr, dplyr
max_dates <- transactions %>%
        group_by(CustomerID) %>%
        summarise(MaxDate = max(Date))
max_dates

-- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.2 --
v tibble  3.1.8     v purrr   0.3.5
v tidyr   1.2.1     v stringr 1.4.1
v readr   2.1.3     v forcats 0.5.2
-- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
x gridExtra::combine() masks dplyr::combine()
x dplyr::filter()      masks stats::filter()
x dplyr::lag()         masks stats::lag()
x purrr::lift()        masks caret::lift()

A tibble: 4 × 2
CustomerID	MaxDate
<dbl>	<date>
215454555	2021-04-30
456415151	2021-01-22
787987456	2019-12-06
985121122	2020-09-09

library(lubridate) #for datetime manipulation
current_date <- ymd("2021-04-30")
max_dates <- max_dates %>%
        mutate(DaysDiff = as.period(current_date - MaxDate) %>% day()) # find period and convert it to days
max_dates


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union

A tibble: 4 × 3
CustomerID	MaxDate	DaysDiff
<dbl>	<date>	<dbl>
215454555	2021-04-30	0
456415151	2021-01-22	98
787987456	2019-12-06	511
985121122	2020-09-09	233

# lests calculate Churn feature
max_dates <- max_dates %>%
        mutate(Churn = ifelse(DaysDiff > 60, 1, 0))
max_dates

A tibble: 4 × 4
CustomerID	MaxDate	DaysDiff	Churn
<dbl>	<date>	<dbl>	<dbl>
215454555	2021-04-30	0	0
456415151	2021-01-22	98	1
787987456	2019-12-06	511	1
985121122	2020-09-09	233	1

You can finally merge code into one query:

churn_eval <- transactions %>%
        group_by(CustomerID) %>%
        summarise(MaxDate = max(Date)) %>%
        mutate(DaysDiff = as.period(current_date - MaxDate) %>% day(),
               Churn = ifelse(DaysDiff > 60, 1, 0)) %>%
        select(CustomerID, Churn) # Select only CustomerID and new feature / target
churn_eval # we will merge it with demographics later

A tibble: 4 × 2
CustomerID	Churn
<dbl>	<dbl>
215454555	0
456415151	1
787987456	1
985121122	1

25.4.2.1 Task. AverageContractSum3M_Credet

TASK: calculate average sum of contract for each customer by credet for last 3 month before Churn/NotChurn event

For calculating this data wee should do this steps:

Find max transaction date
Find transaction date 3 month ago (max transaction date - 3 month)
Filter only records in range (-3 month, max_transaction)
Group data by contracts and find every contract sum
Group data by customers and find every customer average

avgContractSum3m <- transactions %>%
        filter(Type == "credet") %>% # only credet
        group_by(CustomerID) %>% 
        mutate(MaxDate = max(Date), # get max transaction date
               Month3Date = MaxDate %m+% months(-3)) %>% # get date 3 month before max transaction date
       filter(Date >= Month3Date) %>% # only transaction more than 3 month age left
       group_by(CustomerID, ContractID) %>% # group by customer and contract
       summarize(ContractSum = sum(UsdEquiv_sum), .groups = 'drop') %>% # find sum by each contract
       group_by(CustomerID) %>% # group by customer
       summarise(AverageContractSum3M_Credet = mean(ContractSum)) # find mean by each customer

avgContractSum3m

A tibble: 4 × 2
CustomerID	AverageContractSum3M_Credet
<dbl>	<dbl>
215454555	45541.3335
456415151	1271.5930
787987456	116.6667
985121122	4444.4444

25.4.2.2 Task. Find last 3 average credet sum increase

Get dates range for -1, -2, -3 monthes by credet:

monthByMonth <- transactions %>%
        filter(Type == "credet") %>%
        select(CustomerID, Date, UsdEquiv_sum) %>%
        group_by(CustomerID) %>%
        mutate(MaxDate = max(Date),
               Month1Before = MaxDate %m+% months(-1),
               Month2Before = MaxDate %m+% months(-2),
               Month3Before = MaxDate %m+% months(-3)) 
head(monthByMonth)

A grouped_df: 6 × 7
CustomerID	Date	UsdEquiv_sum	MaxDate	Month1Before	Month2Before	Month3Before
<dbl>	<date>	<dbl>	<date>	<date>	<date>	<date>
215454555	2019-03-15	6293.7111	2021-04-30	2021-03-30	2021-02-28	2021-01-30
215454555	2019-04-05	914.9459	2021-04-30	2021-03-30	2021-02-28	2021-01-30
215454555	2019-05-30	2102.0570	2021-04-30	2021-03-30	2021-02-28	2021-01-30
215454555	2019-10-09	10854.7630	2021-04-30	2021-03-30	2021-02-28	2021-01-30
215454555	2019-11-27	639.5389	2021-04-30	2021-03-30	2021-02-28	2021-01-30
215454555	2019-12-20	2654.9459	2021-04-30	2021-03-30	2021-02-28	2021-01-30

Big guery calculate sums for each month and find increases

credetInc <- monthByMonth %>%
    filter(Date >= Month1Before & Date <= MaxDate) %>% # data in range from Month1Before to MaxDate
    group_by(CustomerID) %>%
    summarise(M1Sum = sum(UsdEquiv_sum)) %>% # sum for month -1
    left_join(monthByMonth %>% # join with data about month -2
              filter(Date >= Month2Before & Date <= Month1Before) %>% # data in range from Month2Before to Month1Before
              group_by(CustomerID) %>%
              summarise(M2Sum = sum(UsdEquiv_sum)), by = "CustomerID") %>%
            left_join(monthByMonth %>% # join with data about month -3
              filter(Date >= Month3Before & Date <= Month2Before) %>% # data in range from Month3Before to Month2Before
              group_by(CustomerID) %>%
              summarise(M3Sum = sum(UsdEquiv_sum)), by = "CustomerID") %>%
    rowwise %>% # calculate separately for each row, not for columns
    mutate(AverageIncrease3M_Credet = mean(c(M1Sum/M2Sum, M2Sum/M3Sum), na.rm = T)) %>% # increase is an average of sum changinf from month to month
    select(CustomerID, AverageIncrease3M_Credet)
        
head(credetInc)

A rowwise_df: 4 × 2
CustomerID	AverageIncrease3M_Credet
<dbl>	<dbl>
215454555	8.3372521811
456415151	NaN
787987456	0.0003175611
985121122	NaN

Join all the tables

final_set <- demographics %>% 
        left_join(avgContractSum3m , by = "CustomerID") %>%
        left_join(credetInc , by = "CustomerID") %>%
        left_join(churn_eval, by = "CustomerID")
final_set

A data.frame: 4 × 7
CustomerID	Gender	Email	VisitsLastYear	AverageContractSum3M_Credet	AverageIncrease3M_Credet	Churn
<dbl>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
787987456	Male	Yes	12	116.6667	0.0003175611	1
456415151	Male	NULL	0	1271.5930	NaN	1
215454555	Female	No	16	45541.3335	8.3372521811	0
985121122	Female	No	4	4444.4444	NaN	1

On the next stages this data can be transformed with scaling, encoding, binning.

25.5 References

Feature Engineering in R Programming by (dhruv5819?)
What is Feature Engineering? by Tim Bok
Feature Scaling-Why it is required? by Rahul Saini
Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization by ANIRUDDHA BHANDARI
R Package ‘smbinning’: Optimal Binning for Scoring Modeling by Herman Jopia
Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.
An Overview on the Landscape of R Packages for Credit Scoring by Gero Szepannek