Data frames are the most popular data structure in R, becouse it allows collect data with different columns type in one object and quickly manipulate it.
A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).
The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software.
... - these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.
row.names - NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
stringsAsFactors - logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE
Details
A data frame is a list of variables of the same number of rows with unique row names, given class data.frame. If no variables are included, the row names determine the number of rows.
data.frame converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE). As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to data.frame are converted to factor columns unless protected argument stringsAsFactors is false. If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument.
17.2 Creating Data Frames
Data frames are usually created by reading in a dataset from file, scraping from websites. However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists. In this case I’ll create a simple data frame df and assess its basic structure:
letters_frame <-data.frame(A1 =c("A", "B", "C"),A2 =1:3)letters_framenext_row =c("D", 4) # data types by row should be the same as in initial data framerbind(letters_frame, next_row)
A data.frame: 3 × 2
A1
A2
<chr>
<int>
A
1
B
2
C
3
A data.frame: 4 × 2
A1
A2
<chr>
<chr>
A
1
B
2
C
3
D
4
17.4 Merge DF
Data frames could me merged by key with merge():
df1 <-data.frame(Id =c(1:4),Name =c("Nick", "Jake", "Jane", "Mary"))df1df2 <-data.frame(Id =c(2, 1, 3, 5), # defferent order from Id in df1Age =c(34, 21, 45, 20))df2df_final <-merge(df1, df2, by ="Id", all.x = F, all.y = F)df_final
A data.frame: 4 × 2
Id
Name
<int>
<chr>
1
Nick
2
Jake
3
Jane
4
Mary
A data.frame: 4 × 2
Id
Age
<dbl>
<dbl>
2
34
1
21
3
45
5
20
A data.frame: 3 × 3
Id
Name
Age
<int>
<chr>
<dbl>
1
Nick
21
2
Jake
34
3
Jane
45
17.4.1 Subsetting Data Frames
Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists and will return the selected columns with all rows; if you subset with two vectors, they behave like matrices and can be subset by row and column:
df <-data.frame(int_col =1:5,char_col =c("a", "b", "c", "d", "e"),log_col =c(T,T,T,F,T),double_col =c(2.1, 1, 0.5, pi, 12.7),row.names =paste0("row_", 1:5), # setting row names stringsAsFactors =TRUE) # warning it depends on local settings of Rdf
A data.frame: 5 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_1
1
a
TRUE
2.100000
row_2
2
b
TRUE
1.000000
row_3
3
c
TRUE
0.500000
row_4
4
d
FALSE
3.141593
row_5
5
e
TRUE
12.700000
# select columns using $ signdf$log_col
TRUE
TRUE
TRUE
FALSE
TRUE
# subsetting by row numbersdf[1, ] # first rowdf[nrow(df), ] # last rowdf[-1, ] # all except first row
A data.frame: 1 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_1
1
a
TRUE
2.1
A data.frame: 1 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_5
5
e
TRUE
12.7
A data.frame: 4 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_2
2
b
TRUE
1.000000
row_3
3
c
TRUE
0.500000
row_4
4
d
FALSE
3.141593
row_5
5
e
TRUE
12.700000
# subsetting by row namesdf[c("row_4", "row_5"), ]
A data.frame: 2 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_4
4
d
FALSE
3.141593
row_5
5
e
TRUE
12.700000
# subsetting columns like a listdf[, c("log_col", "double_col")]
A data.frame: 5 × 2
log_col
double_col
<lgl>
<dbl>
row_1
TRUE
2.100000
row_2
TRUE
1.000000
row_3
TRUE
0.500000
row_4
FALSE
3.141593
row_5
TRUE
12.700000
# subset for both rows and columnsdf[2:5, c(1, 3:4)]
A data.frame: 4 × 3
int_col
log_col
double_col
<int>
<lgl>
<dbl>
row_2
2
TRUE
1.000000
row_3
3
TRUE
0.500000
row_4
4
FALSE
3.141593
row_5
5
TRUE
12.700000
You can also subset data frames based on conditional statements
# select only with log_col == TRUEdf[df$double_col >1, ]df[!df$log_col, ]
A data.frame: 3 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_1
1
a
TRUE
2.100000
row_4
4
d
FALSE
3.141593
row_5
5
e
TRUE
12.700000
A data.frame: 1 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_4
4
d
FALSE
3.141593
"s"%in%c("s", "t")
TRUE
# select only with char_col == 'a', 'e'chars <- df$char_col %in%c("a", "e")charssum(chars)df[chars, ] # %in% operator for check multuiple values
TRUE
FALSE
FALSE
FALSE
TRUE
2
A data.frame: 2 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_1
1
a
TRUE
2.1
row_5
5
e
TRUE
12.7
# select only with double_col > 1 and log_col == TRUEdf[df$log_col ==TRUE& df$double_col >1, ]
A data.frame: 2 × 4
int_col
char_col
log_col
double_col
<int>
<fct>
<lgl>
<dbl>
row_1
1
a
TRUE
2.1
row_5
5
e
TRUE
12.7
# select only specific columns with double_col > 1 and log_col == TRUEdf[df$log_col ==TRUE& df$double_col >1, c("log_col", "int_col", "double_col")]
A data.frame: 2 × 3
log_col
int_col
double_col
<lgl>
<int>
<dbl>
row_1
TRUE
1
2.1
row_5
TRUE
5
12.7
17.5 Order data.frame
Let’s use our previous sample data.frame but with unordered values:
df <-data.frame(int_col =c(1, 5, 3, 4, 2),char_col =c("b", "a", "a", "d", "e"),log_col =c(T,T,T,F,T),double_col =c(2.1, 1, 0.5, pi, 12.7),row.names =paste0("row_", 1:5), # setting row names stringsAsFactors =TRUE) # warning it depends on local settings of Rdf
A data.frame: 5 × 4
int_col
char_col
log_col
double_col
<dbl>
<fct>
<lgl>
<dbl>
row_1
1
b
TRUE
2.100000
row_2
5
a
TRUE
1.000000
row_3
3
a
TRUE
0.500000
row_4
4
d
FALSE
3.141593
row_5
2
e
TRUE
12.700000
You can use order() function for sorting data.frames.
# sort by int_colorder(df$char_col)order(df$int_col)df[order(df$char_col),]
2
3
1
4
5
1
5
3
4
2
A data.frame: 5 × 4
int_col
char_col
log_col
double_col
<dbl>
<fct>
<lgl>
<dbl>
row_2
5
a
TRUE
1.000000
row_3
3
a
TRUE
0.500000
row_1
1
b
TRUE
2.100000
row_4
4
d
FALSE
3.141593
row_5
2
e
TRUE
12.700000
Use - minus to sort descending
# sort by double_col# revdf[rev(order(df$int_col)), ]
A data.frame: 5 × 4
int_col
char_col
log_col
double_col
<dbl>
<fct>
<lgl>
<dbl>
row_2
5
a
TRUE
1.000000
row_4
4
d
FALSE
3.141593
row_3
3
a
TRUE
0.500000
row_5
2
e
TRUE
12.700000
row_1
1
b
TRUE
2.100000
You can also sor by multiple columns with order(column1, column2) or order(column1, -column2).
# Convert Student status "yes/no" to 1/0head(as.numeric(credit_data$Student) -1)credit_data$Student <-as.character(credit_data$Student) # convert to character first / factorscredit_data$Student <-ifelse(credit_data$Student =="Yes", 1, 0)head(credit_data)
0
1
0
0
0
0
A data.frame: 6 × 12
ID
Income
Limit
Rating
Cards
Age
Education
Gender
Student
Married
Ethnicity
Balance
<int>
<dbl>
<int>
<int>
<int>
<int>
<int>
<fct>
<dbl>
<fct>
<fct>
<int>
1
1
14.891
3606
283
2
34
11
Male
0
Yes
Caucasian
333
2
2
106.025
6645
483
3
82
15
Female
1
Yes
Asian
903
3
3
104.593
7075
514
4
71
11
Male
0
No
Asian
580
4
4
148.924
9504
681
3
36
11
Female
0
No
Asian
964
5
5
55.882
4897
357
2
68
16
Male
0
Yes
Caucasian
331
6
6
80.180
8047
569
4
77
10
Male
0
No
Caucasian
1151
# 5. Order dataset by `Rating` descendingcredit_data <- credit_data[order(-credit_data$Rating), ]head(credit_data)
A data.frame: 6 × 12
ID
Income
Limit
Rating
Cards
Age
Education
Gender
Student
Married
Ethnicity
Balance
<int>
<dbl>
<int>
<int>
<int>
<int>
<int>
<fct>
<dbl>
<fct>
<fct>
<int>
324
324
182.728
13913
982
4
98
17
Male
0
Yes
Caucasian
1999
29
29
186.634
13414
949
2
41
14
Female
0
Yes
African American
1809
356
356
180.682
11966
832
2
58
8
Female
0
Yes
African American
1405
86
86
152.298
12066
828
4
41
12
Female
0
Yes
Asian
1779
294
294
140.672
11200
817
7
46
9
Male
0
Yes
African American
1677
185
185
158.889
11589
805
1
62
17
Female
0
Yes
Caucasian
1448
# 6. Filter only `Age > 50` with `Rating > 400`credict_data_filtered <- credit_data[credit_data$Age >50& credit_data$Rating >400, ]head(credict_data_filtered)nrow(credict_data_filtered)
A data.frame: 6 × 12
ID
Income
Limit
Rating
Cards
Age
Education
Gender
Student
Married
Ethnicity
Balance
<int>
<dbl>
<int>
<int>
<int>
<int>
<int>
<fct>
<dbl>
<fct>
<fct>
<int>
324
324
182.728
13913
982
4
98
17
Male
0
Yes
Caucasian
1999
356
356
180.682
11966
832
2
58
8
Female
0
Yes
African American
1405
185
185
158.889
11589
805
1
62
17
Female
0
Yes
Caucasian
1448
348
348
160.231
10748
754
2
69
17
Male
0
No
Caucasian
1192
175
175
121.834
10673
750
3
54
16
Male
0
No
African American
1573
391
391
135.118
10578
747
3
81
15
Female
0
Yes
Asian
1393
73
#7. Evaluate average `Income` for `Married = YES` `Married = NO` with age in rage [20,30]# 7.1 Make the same for Age [30;40]married <-with(credit_data, credit_data[(Age>=20& Age <=30) & Married =="Yes", ])head(married)
mean(married$Income) # is it better to be merried? :)mean(not_married$Income)
35.25635
25.1618333333333
17.6 References
The Comprehensive R Archive NetworkRcran: Url: https://cran.r-project.org/
RStudio official website. Url: https://rstudio.com/
Anaconda official website. Url: https://www.anaconda.com/
Introduction to R. Datacamp interactive course. Url: https://www.datacamp.com/courses/free-introduction-to-r
Quanargo. Introduction to R. Url: https://www.quantargo.com/courses/course-r-introduction
R Coder Project. Begin your data science career with R language! Url: https://r-coder.com/
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.
A.B. Shipunov, EM Baldin, P.A. Volkova, VG Sufiyanov. Visual statistics. We use R! - M .: DMK Press, 2012. - 298 p .: ill.
An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.html
R programming. https://www.datamentor.io/r-programming
Learn R. R Functions. https://www.w3schools.com/r/r_functions.asp
UC Business Analytics R Programming Guide. Managing Data Frames. http://uc-r.github.io/dataframes
Learn R programming. R - Lists. https://www.tutorialspoint.com/r/r_lists.htm
Tutorial on the R Apply Family by Carlo Fanara. https://www.datacamp.com/community/tutorials/r-tutorial-apply-family