3 CSV

Автор

Юрій Клебан

You need this packages for code execution:

# install.packages("readr")

invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

3.1 What is CSV (Comma Separated Values)?

CSV - comma separated values.

# lets check current working directory to write correct files path
getwd()

'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en'

You can use / or \\ for writing correct path in R. For example:

path = "d:/projects/file.csv"
path = "d:\\projects\\file.csv"

To combine path use paste() or paste0() functions

work_dir = getwd()
work_dir

'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en'

file_name = "temp_file.csv"
file_path = paste0(work_dir, "/", file_name)
file_path

'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en/temp_file.csv'

file_path = paste(work_dir, file_name, sep = "/")
file_path

'E:/Repos/Season 2022/r-book/_book/docs/data-analysis-en/temp_file.csv'

3.2 Sample dataset description

Information about dataset from kaggle.com. Original file located at url: https://www.kaggle.com/radmirzosimov/telecom-users-dataset.

Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.

Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.

You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator’s services, the method of payment, and the amount of payment.

The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:

Description of the data (with the calculation of basic statistics);
Research of dependencies and formulation of hypotheses;
Building models for predicting the outflow (with justification for the choice of a particular model) 4. based on tested hypotheses and identified relationships;
Comparison of the quality of the obtained models.

Fields description:

3.3 Reading

Thare are few methods for reading/writing csv in base package:

read.csv(), write.csv - default data separator is ,, decimal is separator ..
read.csv2(), write.csv2 - default data separator is ;, decimal is separator ,.

Before using any new function check it usage information with help(function_name) or ?function_name, example: ?read.csv.

You can read (current data set has NA values as example, there are no NA in original datase):

data <- read.csv("../../data/telecom_users.csv") # default reading
str(data)

'data.frame':   5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
 $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
 $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
 $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
 $ MonthlyCharges  : chr  "24.1" "88.15" "74.95" "55.9" ...
 $ TotalCharges    : num  1735 3973 2870 238 120 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...

data <- read.csv("../../data/telecom_users.csv",
                  sep = ",", # comma not only possibel separator
                  dec = ".", # decimal separator can be different
                  na.strings = c("", "NA", "NULL")) # you can define NA values

str(data) # chack data structure / types/ values

'data.frame':   5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
 $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
 $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
 $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
 $ MonthlyCharges  : num  24.1 88.2 75 55.9 53.5 ...
 $ TotalCharges    : num  1735 3973 2870 238 120 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...

head(data, 2) # top 6 rows, use n = X, for viewing top X lines

A data.frame: 2 × 22
	X	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	...	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<chr>
1	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.10	1734.65	No
2	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	...	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.20	No

is.data.frame(data) # if data is data.frame

TRUE

anyNA(data) # if dataframe contains any NA values

TRUE

lapply(data, anyNA)
#lapply(, any) #check NA by 2nd dimension - columns

$X: FALSE
$customerID: FALSE
$gender: FALSE
$SeniorCitizen: FALSE
$Partner: FALSE
$Dependents: FALSE
$tenure: FALSE
$PhoneService: FALSE
$MultipleLines: FALSE
$InternetService: FALSE
$OnlineSecurity: FALSE
$OnlineBackup: FALSE
$DeviceProtection: FALSE
$TechSupport: FALSE
$StreamingTV: FALSE
$StreamingMovies: FALSE
$Contract: FALSE
$PaperlessBilling: FALSE
$PaymentMethod: FALSE
$MonthlyCharges: TRUE
$TotalCharges: TRUE
$Churn: FALSE

Check MonthlyCharges: TRUE and TotalCharges: TRUE. These columns has NA-values.

Let’s replace them with mean:

data[is.na(data$TotalCharges), "TotalCharges"] <- mean(data$TotalCharges, na.rm = T)
data[is.na(data$MonthlyCharges), "MonthlyCharges"] <- mean(data$MonthlyCharges, na.rm = T)

any(is.na(data)) # check for NA

FALSE

You can write data with write.csv(), write.csv2() from base package.

write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
# by default row.names = TRUE and file will contain first column with row numbers 1,2, ..., N

ERROR: Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame

Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame
Traceback:

1. write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
2. eval.parent(Call)
3. eval(expr, p)
4. eval(expr, p)
5. utils::write.table(data, file = "../../data/cleaned_data.csv", 
 .     row.names = F, col.names = TRUE, sep = ",", dec = ".", qmethod = "double")
6. data.frame(x)
7. as.data.frame(x[[i]], optional = TRUE)
8. as.data.frame.default(x[[i]], optional = TRUE)
9. stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])), 
 .     domain = NA)

3.4 `readr` package

One more useful package is readr. Examples of using:

# library(readr)
# data <- read_csv(file = "../../data/telecom_users.csv")
# data <- read_csv2(file = "../../data/telecom_users.csv")`

3.5 Набори даних

https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

3.6 References

SQLite in R. Datacamp
Tidyverse googlesheets4 0.2.0
Binanace spot Api Docs
Web Scraping in R: rvest Tutorial by Arvid Kingl