You need this packages for code execution:
3.1 What is CSV (Comma Separated Values)?
- comma separated values.
You can use /
or \\
for writing correct path in R. For example:
To combine path use paste()
or paste0()
3.2 Sample dataset description
Information about dataset from kaggle.com. Original file located at url: https://www.kaggle.com/radmirzosimov/telecom-users-dataset.
Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.
Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.
You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator’s services, the method of payment, and the amount of payment.
The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:
- Description of the data (with the calculation of basic statistics);
- Research of dependencies and formulation of hypotheses;
- Building models for predicting the outflow (with justification for the choice of a particular model) 4. based on tested hypotheses and identified relationships;
- Comparison of the quality of the obtained models.
Fields description:
3.3 Reading
Thare are few methods for reading/writing csv in base
Before using any new function check it usage information with help(function_name)
or ?function_name
, example: ?read.csv
You can read (current data set has NA values as example, there are no NA in original datase):
'data.frame': 5986 obs. of 22 variables:
$ X : int 1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
$ customerID : chr "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
$ gender : chr "Male" "Female" "Female" "Male" ...
$ SeniorCitizen : int 0 0 1 0 0 0 0 0 0 1 ...
$ Partner : chr "Yes" "No" "Yes" "No" ...
$ Dependents : chr "Yes" "No" "No" "No" ...
$ tenure : int 72 44 38 4 2 70 33 1 39 55 ...
$ PhoneService : chr "Yes" "Yes" "Yes" "Yes" ...
$ MultipleLines : chr "Yes" "No" "Yes" "No" ...
$ InternetService : chr "No" "Fiber optic" "Fiber optic" "DSL" ...
$ OnlineSecurity : chr "No internet service" "No" "No" "No" ...
$ OnlineBackup : chr "No internet service" "Yes" "No" "No" ...
$ DeviceProtection: chr "No internet service" "Yes" "No" "No" ...
$ TechSupport : chr "No internet service" "No" "No" "No" ...
$ StreamingTV : chr "No internet service" "Yes" "No" "No" ...
$ StreamingMovies : chr "No internet service" "No" "No" "Yes" ...
$ Contract : chr "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
$ PaperlessBilling: chr "No" "Yes" "Yes" "Yes" ...
$ PaymentMethod : chr "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
$ MonthlyCharges : chr "24.1" "88.15" "74.95" "55.9" ...
$ TotalCharges : num 1735 3973 2870 238 120 ...
$ Churn : chr "No" "No" "Yes" "No" ...
'data.frame': 5986 obs. of 22 variables:
$ X : int 1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
$ customerID : chr "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
$ gender : chr "Male" "Female" "Female" "Male" ...
$ SeniorCitizen : int 0 0 1 0 0 0 0 0 0 1 ...
$ Partner : chr "Yes" "No" "Yes" "No" ...
$ Dependents : chr "Yes" "No" "No" "No" ...
$ tenure : int 72 44 38 4 2 70 33 1 39 55 ...
$ PhoneService : chr "Yes" "Yes" "Yes" "Yes" ...
$ MultipleLines : chr "Yes" "No" "Yes" "No" ...
$ InternetService : chr "No" "Fiber optic" "Fiber optic" "DSL" ...
$ OnlineSecurity : chr "No internet service" "No" "No" "No" ...
$ OnlineBackup : chr "No internet service" "Yes" "No" "No" ...
$ DeviceProtection: chr "No internet service" "Yes" "No" "No" ...
$ TechSupport : chr "No internet service" "No" "No" "No" ...
$ StreamingTV : chr "No internet service" "Yes" "No" "No" ...
$ StreamingMovies : chr "No internet service" "No" "No" "Yes" ...
$ Contract : chr "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
$ PaperlessBilling: chr "No" "Yes" "Yes" "Yes" ...
$ PaymentMethod : chr "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
$ MonthlyCharges : num 24.1 88.2 75 55.9 53.5 ...
$ TotalCharges : num 1735 3973 2870 238 120 ...
$ Churn : chr "No" "No" "Yes" "No" ...
X | customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
<int> | <chr> | <chr> | <int> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | ... | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <chr> | |
1 | 1869 | 7010-BRBUU | Male | 0 | Yes | Yes | 72 | Yes | Yes | No | ... | No internet service | No internet service | No internet service | No internet service | Two year | No | Credit card (automatic) | 24.10 | 1734.65 | No |
2 | 4528 | 9688-YGXVR | Female | 0 | No | No | 44 | Yes | No | Fiber optic | ... | Yes | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 88.15 | 3973.20 | No |
- $X
- $customerID
- $gender
- $SeniorCitizen
- $Partner
- $Dependents
- $tenure
- $PhoneService
- $MultipleLines
- $InternetService
- $OnlineSecurity
- $OnlineBackup
- $DeviceProtection
- $TechSupport
- $StreamingTV
- $StreamingMovies
- $Contract
- $PaperlessBilling
- $PaymentMethod
- $MonthlyCharges
- $TotalCharges
- $Churn
Check MonthlyCharges: TRUE
and TotalCharges: TRUE
. These columns has NA-values.
Let’s replace them with mean
You can write data with write.csv()
, write.csv2()
from base
write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
# by default row.names = TRUE and file will contain first column with row numbers 1,2, ..., N
ERROR: Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame
Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame
1. write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
2. eval.parent(Call)
3. eval(expr, p)
4. eval(expr, p)
5. utils::write.table(data, file = "../../data/cleaned_data.csv",
. row.names = F, col.names = TRUE, sep = ",", dec = ".", qmethod = "double")
6. data.frame(x)
7. as.data.frame(x[[i]], optional = TRUE)
8. as.data.frame.default(x[[i]], optional = TRUE)
9. stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])),
. domain = NA)
3.4 readr
One more useful package is readr
. Examples of using:
3.5 Набори даних
- https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
- https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
- https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
- https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml