17  Дата-фрейми [EN]

Автор

Юрій Клебан

17.1 What is dataframes?

Data frames are the most popular data structure in R, becouse it allows collect data with different columns type in one object and quickly manipulate it.

A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software.

Syntax

data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = FALSE)

Arguments (top useful)

  • ... - these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.
  • row.names - NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
  • stringsAsFactors - logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE

Details

A data frame is a list of variables of the same number of rows with unique row names, given class data.frame. If no variables are included, the row names determine the number of rows.

data.frame converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE). As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to data.frame are converted to factor columns unless protected argument stringsAsFactors is false. If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument.


17.2 Creating Data Frames

Data frames are usually created by reading in a dataset from file, scraping from websites. However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists. In this case I’ll create a simple data frame df and assess its basic structure:

df <- data.frame(id = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7))
df
A data.frame: 5 × 4
idchar_collog_coldouble_col
<int><chr><lgl><dbl>
1a TRUE 2.100000
2b TRUE 1.000000
3c TRUE 0.500000
4dFALSE 3.141593
5e TRUE12.700000
# assess the structure of a data frame
str(df)
'data.frame':   5 obs. of  4 variables:
 $ id        : int  1 2 3 4 5
 $ char_col  : chr  "a" "b" "c" "d" ...
 $ log_col   : logi  TRUE TRUE TRUE FALSE TRUE
 $ double_col: num  2.1 1 0.5 3.14 12.7
# number of rows
nrow(df)
5
# number of columns
ncol(df)
4

If you want convert “on fly” character columns to factor use stringsAsFactors = TRUE:

df <- data.frame(i = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df
A data.frame: 5 × 4
ichar_collog_coldouble_col
<int><fct><lgl><dbl>
1a TRUE 2.100000
2b TRUE 1.000000
3c TRUE 0.500000
4dFALSE 3.141593
5e TRUE12.700000

Creating data.frames from lists (P.S. lists explained in next chapter):

We can create data.frame from vectors:

v_int <- 1:5
v_char <- c("a", "b", "c", "d", "e")
v_log <- c(T,T,T,F,T)
v_double <- c(2.1, 1, 0.5, pi, 12.7)
demo_list <- list(int_col = v_int,
                  char_col = v_char,
                  log_col = v_log,
                  double_col = v_double)
as.data.frame(demo_list)
A data.frame: 5 × 4
int_colchar_collog_coldouble_col
<int><chr><lgl><dbl>
1a TRUE 2.100000
2b TRUE 1.000000
3c TRUE 0.500000
4dFALSE 3.141593
5e TRUE12.700000

Matrix can be base for data frame too:

demo_matrix <- matrix(100:119, nrow = 5, ncol = 4)
demo_matrix

as.data.frame(demo_matrix)
A matrix: 5 × 4 of type int
100105110115
101106111116
102107112117
103108113118
104109114119
A data.frame: 5 × 4
V1V2V3V4
<int><int><int><int>
100105110115
101106111116
102107112117
103108113118
104109114119

17.3 Extending data frames

You can add rows and columns to data frame. Merging two data frames by selected column values awailable too.

cbind() adds new column

df <-  data.frame(A1 = c("A", "B", "C"),
                  A2 = c("D", "E", "F"))
df

A3 = c(1, 2, 3)
cbind(df, A3)


colnames(df)
colnames(df) <- c("B1", "B2")
colnames(df)
A data.frame: 3 × 2
A1A2
<chr><chr>
AD
BE
CF
A data.frame: 3 × 3
A1A2A3
<chr><chr><dbl>
AD1
BE2
CF3
  1. 'A1'
  2. 'A2'
  1. 'B1'
  2. 'B2'

rbind() adds new row

letters_frame <-  data.frame(A1 = c("A", "B", "C"),
                            A2 = 1:3)
letters_frame

next_row = c("D", 4) # data types by row should be the same as in initial data frame
rbind(letters_frame, next_row)
A data.frame: 3 × 2
A1A2
<chr><int>
A1
B2
C3
A data.frame: 4 × 2
A1A2
<chr><chr>
A1
B2
C3
D4

17.4 Merge DF

Data frames could me merged by key with merge():

df1 <- data.frame(Id = c(1:4),
                  Name = c("Nick", "Jake", "Jane", "Mary"))
df1

df2 <- data.frame(Id = c(2, 1, 3, 5), # defferent order from Id in df1
                  Age = c(34, 21, 45, 20))
df2

df_final <- merge(df1, df2, by = "Id", all.x = F, all.y = F)
df_final
A data.frame: 4 × 2
IdName
<int><chr>
1Nick
2Jake
3Jane
4Mary
A data.frame: 4 × 2
IdAge
<dbl><dbl>
234
121
345
520
A data.frame: 3 × 3
IdNameAge
<int><chr><dbl>
1Nick21
2Jake34
3Jane45

17.4.1 Subsetting Data Frames

Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists and will return the selected columns with all rows; if you subset with two vectors, they behave like matrices and can be subset by row and column:

df <- data.frame(int_col = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                row.names = paste0("row_", 1:5), # setting row names 
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df
A data.frame: 5 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_11a TRUE 2.100000
row_22b TRUE 1.000000
row_33c TRUE 0.500000
row_44dFALSE 3.141593
row_55e TRUE12.700000
# select columns using $ sign
df$log_col
  1. TRUE
  2. TRUE
  3. TRUE
  4. FALSE
  5. TRUE
# subsetting by row numbers
df[1, ] # first row
df[nrow(df), ] # last row
df[-1, ] # all except first row
A data.frame: 1 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_11aTRUE2.1
A data.frame: 1 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_55eTRUE12.7
A data.frame: 4 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_22b TRUE 1.000000
row_33c TRUE 0.500000
row_44dFALSE 3.141593
row_55e TRUE12.700000
# subsetting by row names
df[c("row_4", "row_5"), ]
A data.frame: 2 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_44dFALSE 3.141593
row_55e TRUE12.700000
# subsetting columns like a list
df[, c("log_col", "double_col")]
A data.frame: 5 × 2
log_coldouble_col
<lgl><dbl>
row_1 TRUE 2.100000
row_2 TRUE 1.000000
row_3 TRUE 0.500000
row_4FALSE 3.141593
row_5 TRUE12.700000
# subset for both rows and columns
df[2:5, c(1, 3:4)]
A data.frame: 4 × 3
int_collog_coldouble_col
<int><lgl><dbl>
row_22 TRUE 1.000000
row_33 TRUE 0.500000
row_44FALSE 3.141593
row_55 TRUE12.700000

You can also subset data frames based on conditional statements

# select only with log_col == TRUE
df[df$double_col > 1, ]

df[!df$log_col, ]
A data.frame: 3 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_11a TRUE 2.100000
row_44dFALSE 3.141593
row_55e TRUE12.700000
A data.frame: 1 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_44dFALSE3.141593
"s" %in% c("s", "t")
TRUE
# select only with char_col == 'a', 'e'
chars <- df$char_col %in% c("a", "e")
chars
sum(chars)
df[chars, ] # %in% operator for check multuiple values
  1. TRUE
  2. FALSE
  3. FALSE
  4. FALSE
  5. TRUE
2
A data.frame: 2 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_11aTRUE 2.1
row_55eTRUE12.7
# select only with double_col > 1 and log_col == TRUE
df[df$log_col == TRUE & df$double_col > 1, ]
A data.frame: 2 × 4
int_colchar_collog_coldouble_col
<int><fct><lgl><dbl>
row_11aTRUE 2.1
row_55eTRUE12.7
# select only specific columns with double_col > 1 and log_col == TRUE
df[df$log_col == TRUE & df$double_col > 1, c("log_col", "int_col", "double_col")]
A data.frame: 2 × 3
log_colint_coldouble_col
<lgl><int><dbl>
row_1TRUE1 2.1
row_5TRUE512.7

17.5 Order data.frame

Let’s use our previous sample data.frame but with unordered values:

df <- data.frame(int_col = c(1, 5, 3, 4, 2),
                char_col = c("b", "a", "a", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                row.names = paste0("row_", 1:5), # setting row names 
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df
A data.frame: 5 × 4
int_colchar_collog_coldouble_col
<dbl><fct><lgl><dbl>
row_11b TRUE 2.100000
row_25a TRUE 1.000000
row_33a TRUE 0.500000
row_44dFALSE 3.141593
row_52e TRUE12.700000

You can use order() function for sorting data.frames.

# sort by int_col
order(df$char_col)
order(df$int_col)
df[order(df$char_col),]
  1. 2
  2. 3
  3. 1
  4. 4
  5. 5
  1. 1
  2. 5
  3. 3
  4. 4
  5. 2
A data.frame: 5 × 4
int_colchar_collog_coldouble_col
<dbl><fct><lgl><dbl>
row_25a TRUE 1.000000
row_33a TRUE 0.500000
row_11b TRUE 2.100000
row_44dFALSE 3.141593
row_52e TRUE12.700000

Use - minus to sort descending

# sort by double_col
# rev
df[rev(order(df$int_col)), ]
A data.frame: 5 × 4
int_colchar_collog_coldouble_col
<dbl><fct><lgl><dbl>
row_25a TRUE 1.000000
row_44dFALSE 3.141593
row_33a TRUE 0.500000
row_52e TRUE12.700000
row_11b TRUE 2.100000

You can also sor by multiple columns with order(column1, column2) or order(column1, -column2).


17.5.1 Manipulating data.frames

typeconvert
ifelse
createnew columns (calculate age) ?lubridate
missing remove
missing replace
edit with dataeditR

17.5.2 Tasks

17.5.3 Task 1

Write a code evaluates $y = x^2 + e, where x is a random number in range [0; 1].

Print calculation result as data.frame with columns X, E, Y.

Use plot() funtion to visualize X vs Y as line chart (type = l or b).

Solution

# initiate data.frame
df <- data.frame(X = 1:10,
                 E = sample(5, 10, replace = T),
                 Y = NA)
head(df)
A data.frame: 6 × 3
XEY
<int><int><lgl>
114NA
221NA
332NA
443NA
552NA
662NA
df$Y <- with(df, X^2 + E)
head(df)
A data.frame: 6 × 3
XEY
<int><int><dbl>
114 5
221 5
33211
44319
55227
66238
plot(df$X, df$Y, type="l", col = "blue")


17.5.4 Task 2

  1. Install package and load package ISLR

  2. Save dataset Credit into variable credit_data.

  3. Check dataset structure with str() function.

  4. Convert Student status “yes/no” to 1/0

  5. Order dataset by Rating descending

  6. Filter only Age > 50 with Rating > 400, how many records do you get?

  7. Evaluate average Income for Married = YES Married = NO with Age in range [20,30]

    7.1 Make the same for Age [30;40] Any conclusion?

Solution

# 1. install.package ISLR
#install.packages("ISLR")
library(ISLR)
# 2. Save dataset `Credit` into variable `credit_data`.
credit_data <- ISLR::Credit
head(credit_data, 3)
A data.frame: 3 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><fct><fct><fct><int>
11 14.891360628323411 Male No YesCaucasian333
22106.025664548338215FemaleYesYesAsian 903
33104.593707551447111 Male No No Asian 580
# 3. Check dataset structure with `str()` function.
str(credit_data)
'data.frame':   400 obs. of  12 variables:
 $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Income   : num  14.9 106 104.6 148.9 55.9 ...
 $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
 $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
 $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
 $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
 $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
 $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
 $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
 $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
 $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
 $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...
# Convert Student status "yes/no" to 1/0

head(as.numeric(credit_data$Student) - 1)

credit_data$Student <- as.character(credit_data$Student) # convert to character first / factors
credit_data$Student <- ifelse(credit_data$Student == "Yes", 1, 0)

head(credit_data)
  1. 0
  2. 1
  3. 0
  4. 0
  5. 0
  6. 0
A data.frame: 6 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><dbl><fct><fct><int>
11 14.891360628323411 Male 0YesCaucasian 333
22106.025664548338215Female1YesAsian 903
33104.593707551447111 Male 0No Asian 580
44148.924950468133611Female0No Asian 964
55 55.882489735726816 Male 0YesCaucasian 331
66 80.180804756947710 Male 0No Caucasian1151
# 5. Order dataset by `Rating` descending
credit_data <- credit_data[order(-credit_data$Rating), ]
head(credit_data)
A data.frame: 6 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><dbl><fct><fct><int>
324324182.7281391398249817 Male 0YesCaucasian 1999
29 29186.6341341494924114Female0YesAfrican American1809
356356180.68211966832258 8Female0YesAfrican American1405
86 86152.2981206682844112Female0YesAsian 1779
294294140.67211200817746 9 Male 0YesAfrican American1677
185185158.8891158980516217Female0YesCaucasian 1448
# 6. Filter only `Age > 50` with `Rating > 400`
credict_data_filtered <- credit_data[credit_data$Age > 50 & credit_data$Rating > 400, ]
head(credict_data_filtered)
nrow(credict_data_filtered)
A data.frame: 6 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><dbl><fct><fct><int>
324324182.7281391398249817 Male 0YesCaucasian 1999
356356180.68211966832258 8Female0YesAfrican American1405
185185158.8891158980516217Female0YesCaucasian 1448
348348160.2311074875426917 Male 0No Caucasian 1192
175175121.8341067375035416 Male 0No African American1573
391391135.1181057874738115Female0YesAsian 1393
73
#7. Evaluate average `Income` for `Married = YES` `Married = NO` with age in rage [20,30]
# 7.1 Make the same for Age [30;40]
married <- with(credit_data, credit_data[(Age>=20 & Age <=30) & Married == "Yes", ])
head(married)
A data.frame: 6 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><dbl><fct><fct><int>
11 1163.095811758943014 Male 0YesCaucasian 1407
11411469.251638647443012Female0YesAsian 768
45 4531.861637546932516Female0YesCaucasian 1120
19 1949.5706384448128 9Female0YesAsian 891
44 4436.929625744512414Female0YesAsian 976
15115163.931572843532814Female0YesAfrican American 581
not_married <- credit_data[credit_data$Age %in% c(20:30) & credit_data$Married == "No", ]
head(not_married)
A data.frame: 6 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><dbl><fct><fct><int>
21221229.567530939732515 Male 0NoCaucasian 799
94 9416.479543538822616 Male 0NoAfrican American937
27227244.978486634713010Female0NoCaucasian 436
20620610.793387832182913 Male 0NoCaucasian 638
17917928.316439131622910Female0NoCaucasian 453
18618630.420444231613014Female0NoAfrican American450
mean(married$Income) # is it better to be merried? :)
mean(not_married$Income)
35.25635
25.1618333333333

17.6 References

  1. The Comprehensive R Archive NetworkRcran: Url: https://cran.r-project.org/
  2. RStudio official website. Url: https://rstudio.com/
  3. Anaconda official website. Url: https://www.anaconda.com/
  4. Introduction to R. Datacamp interactive course. Url: https://www.datacamp.com/courses/free-introduction-to-r
  5. Quanargo. Introduction to R. Url: https://www.quantargo.com/courses/course-r-introduction
  6. R Coder Project. Begin your data science career with R language! Url: https://r-coder.com/
  7. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.
  8. A.B. Shipunov, EM Baldin, P.A. Volkova, VG Sufiyanov. Visual statistics. We use R! - M .: DMK Press, 2012. - 298 p .: ill.
  9. An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.html
  10. R programming. https://www.datamentor.io/r-programming
  11. Learn R. R Functions. https://www.w3schools.com/r/r_functions.asp
  12. UC Business Analytics R Programming Guide. Managing Data Frames. http://uc-r.github.io/dataframes
  13. Learn R programming. R - Lists. https://www.tutorialspoint.com/r/r_lists.htm
  14. Tutorial on the R Apply Family by Carlo Fanara. https://www.datacamp.com/community/tutorials/r-tutorial-apply-family