4  MS Excel (xlsx)

Автор

Юрій Клебан

You need this packages for code execution:

# install.packages("xlsx")
invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

4.1 XLSX-format

There are many packages to read/write MS Excel files. xlsx one of the most useful.

# install.packages("xlsx") #install before use it
library(xlsx)
any(grepl("xlsx", installed.packages())) # check if package installed
TRUE

?read.xlsx - review package functions and params

Let’s read the data telecom_users.xlsx:

data <- read.xlsx("../../data/telecom_users.xlsx", sheetIndex = 1)
# sheetIndex = 1 - select sheet to read, or use sheetName = "sheet1" to read by Name
head(data)
A data.frame: 6 × 21
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> ... <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.10 1734.65 No
2 9688-YGXVR Female 0 No No 44 Yes No Fiber optic No ... Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6994-KERXL Male 0 No No 4 Yes No DSL No ... No No No Yes Month-to-month Yes Electronic check 55.90 238.50 No
5 2181-UAESM Male 0 No No 2 Yes No DSL Yes ... Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes ... Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No
# You can also use startRow, endRow and other params to define how much data read
data <- read.xlsx("../../data/telecom_users.xlsx", sheetIndex = 1, endRow = 100)
head(data)
A data.frame: 6 × 21
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
<chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> ... <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No No internet service ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.10 1734.65 No
2 9688-YGXVR Female 0 No No 44 Yes No Fiber optic No ... Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
3 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
4 6994-KERXL Male 0 No No 4 Yes No DSL No ... No No No Yes Month-to-month Yes Electronic check 55.90 238.50 No
5 2181-UAESM Male 0 No No 2 Yes No DSL Yes ... Yes No No No Month-to-month No Electronic check 53.45 119.50 No
6 4312-GVYNH Female 0 Yes No 70 No No phone service DSL Yes ... Yes Yes No Yes Two year Yes Bank transfer (automatic) 49.85 3370.20 No

Let’s replace Churn values Yes/No by 1/0:

head(data$Churn)
  1. 'No'
  2. 'No'
  3. 'Yes'
  4. 'No'
  5. 'No'
  6. 'No'
data$Churn <- ifelse(data$Churn == "Yes", 1, 0)
head(data$Churn)
  1. 0
  2. 0
  3. 1
  4. 0
  5. 0
  6. 0

Write final data to excel:

write.xlsx(data, file = "../../data/final_telecom_data.xlsx")

4.2 Task 1

Download from kaggle.com and read dataset Default_Fin.csv: https://www.kaggle.com/kmldas/loan-default-prediction

Description:

This is a synthetic dataset created using actual data from a financial institution. The data has been modified to remove identifiable features and the numbers transformed to ensure they do not link to original source (financial institution).

This is intended to be used for academic purposes for beginners who want to practice financial analytics from a simple financial dataset

  1. Check what columns has missing values
  2. Count default and non-default clients / and parts of total clients in %
  3. Count Employed clients
  4. Count Employed Default clients
  5. Average salary by Employed clients
  6. Rename columns to “id”, “empl”, “balance”, “salary”, “default”

Solution for Task 1

data <- read.csv("../../data/Default_Fin.csv")
head(data)
A data.frame: 6 × 5
Index Employed Bank.Balance Annual.Salary Defaulted.
<int> <int> <dbl> <dbl> <int>
1 1 1 8754.36 532339.56 0
2 2 0 9806.16 145273.56 0
3 3 1 12882.60 381205.68 0
4 4 1 6351.00 428453.88 0
5 5 1 9427.92 461562.00 0
6 6 0 11035.08 89898.72 0
  1. Check what columns has missing values
anyNA(data)
FALSE
  1. Count default and non-default clients / and parts of total clients in %
def_count <- nrow(data[data$Defaulted. == 1, ])
no_def_count <- nrow(data[data$Defaulted. == 0, ])
def_count
no_def_count 
333
9667
def_count / nrow(data) * 100 # part defaults
no_def_count / nrow(data) * 100 # part non-defaults
3.33
96.67
  1. Count Employed clients
empl <- data[data$Employed == 1, ]
nrow(empl)
7056
  1. Count Employed Default clients
empl <- data[data$Employed == 1 & data$Defaulted. == 1, ]
nrow(empl)
206
  1. Average salary by Employed clients
empl <- data[data$Employed == 1, ]
mean(empl$Annual.Salary)
480143.43414966
  1. Rename columns to “id”, “empl”, “balance”, “salary”, “default”:
colnames(data) <- c("id", "empl", "balance", "salary", "default")
head(data)
A data.frame: 6 × 5
id empl balance salary default
<int> <int> <dbl> <dbl> <int>
1 1 1 8754.36 532339.56 0
2 2 0 9806.16 145273.56 0
3 3 1 12882.60 381205.68 0
4 4 1 6351.00 428453.88 0
5 5 1 9427.92 461562.00 0
6 6 0 11035.08 89898.72 0

4.3 Набори даних

  1. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
  2. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
  3. https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
  4. https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

4.4 References

  1. SQLite in R. Datacamp
  2. Tidyverse googlesheets4 0.2.0
  3. Binanace spot Api Docs
  4. Web Scraping in R: rvest Tutorial by Arvid Kingl