3  CSV

Автор

Юрій Клебан


You need this packages for code execution:

# install.packages("readr")
invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

3.1 What is CSV (Comma Separated Values)?

CSV - comma separated values.

# lets check current working directory to write correct files path
getwd()
'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en'

You can use / or \\ for writing correct path in R. For example:

path = "d:/projects/file.csv"
path = "d:\\projects\\file.csv"

To combine path use paste() or paste0() functions

work_dir = getwd()
work_dir 
'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en'
file_name = "temp_file.csv"
file_path = paste0(work_dir, "/", file_name)
file_path
'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en/temp_file.csv'
file_path = paste(work_dir, file_name, sep = "/")
file_path
'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en/temp_file.csv'

3.2 Sample dataset description

Information about dataset from kaggle.com. Original file located at url: https://www.kaggle.com/radmirzosimov/telecom-users-dataset.

Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.

Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.

You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator’s services, the method of payment, and the amount of payment.

The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:

  1. Description of the data (with the calculation of basic statistics);
  2. Research of dependencies and formulation of hypotheses;
  3. Building models for predicting the outflow (with justification for the choice of a particular model) 4. based on tested hypotheses and identified relationships;
  4. Comparison of the quality of the obtained models.

Fields description:

3.3 Reading

Thare are few methods for reading/writing csv in base package:

Before using any new function check it usage information with help(function_name) or ?function_name, example: ?read.csv.

You can read (current data set has NA values as example, there are no NA in original datase):

data <- read.csv("../../data/telecom_users.csv") # default reading
str(data)
'data.frame':   5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
 $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
 $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
 $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
 $ MonthlyCharges  : chr  "24.1" "88.15" "74.95" "55.9" ...
 $ TotalCharges    : num  1735 3973 2870 238 120 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...
data <- read.csv("../../data/telecom_users.csv",
                  sep = ",", # comma not only possibel separator
                  dec = ".", # decimal separator can be different
                  na.strings = c("", "NA", "NULL")) # you can define NA values
str(data) # chack data structure / types/ values
'data.frame':   5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
 $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
 $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
 $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
 $ MonthlyCharges  : num  24.1 88.2 75 55.9 53.5 ...
 $ TotalCharges    : num  1735 3973 2870 238 120 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...
head(data, 2) # top 6 rows, use n = X, for viewing top X lines
A data.frame: 2 × 22
X customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> ... <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.10 1734.65 No
2 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic ... Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
is.data.frame(data) # if data is data.frame
TRUE
anyNA(data) # if dataframe contains any NA values
TRUE
lapply(data, anyNA)
#lapply(, any) #check NA by 2nd dimension - columns
$X
FALSE
$customerID
FALSE
$gender
FALSE
$SeniorCitizen
FALSE
$Partner
FALSE
$Dependents
FALSE
$tenure
FALSE
$PhoneService
FALSE
$MultipleLines
FALSE
$InternetService
FALSE
$OnlineSecurity
FALSE
$OnlineBackup
FALSE
$DeviceProtection
FALSE
$TechSupport
FALSE
$StreamingTV
FALSE
$StreamingMovies
FALSE
$Contract
FALSE
$PaperlessBilling
FALSE
$PaymentMethod
FALSE
$MonthlyCharges
TRUE
$TotalCharges
TRUE
$Churn
FALSE

Check MonthlyCharges: TRUE and TotalCharges: TRUE. These columns has NA-values.

Let’s replace them with mean:

data[is.na(data$TotalCharges), "TotalCharges"] <- mean(data$TotalCharges, na.rm = T)
data[is.na(data$MonthlyCharges), "MonthlyCharges"] <- mean(data$MonthlyCharges, na.rm = T)
any(is.na(data)) # check for NA
FALSE

You can write data with write.csv(), write.csv2() from base package.

write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
# by default row.names = TRUE and file will contain first column with row numbers 1,2, ..., N
ERROR: Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame

Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame
Traceback:

1. write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
2. eval.parent(Call)
3. eval(expr, p)
4. eval(expr, p)
5. utils::write.table(data, file = "../../data/cleaned_data.csv", 
 .     row.names = F, col.names = TRUE, sep = ",", dec = ".", qmethod = "double")
6. data.frame(x)
7. as.data.frame(x[[i]], optional = TRUE)
8. as.data.frame.default(x[[i]], optional = TRUE)
9. stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])), 
 .     domain = NA)

3.4 readr package

One more useful package is readr. Examples of using:

# library(readr)
# data <- read_csv(file = "../../data/telecom_users.csv")
# data <- read_csv2(file = "../../data/telecom_users.csv")`

3.5 Набори даних

  1. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
  2. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
  3. https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
  4. https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

3.6 References

  1. SQLite in R. Datacamp
  2. Tidyverse googlesheets4 0.2.0
  3. Binanace spot Api Docs
  4. Web Scraping in R: rvest Tutorial by Arvid Kingl