17 Дата-фрейми [EN]

Автор

Юрій Клебан

17.1 What is dataframes?

Data frames are the most popular data structure in R, becouse it allows collect data with different columns type in one object and quickly manipulate it.

A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software.

Syntax

data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = FALSE)

Arguments (top useful)

... - these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.
row.names - NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
stringsAsFactors - logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE

Details

A data frame is a list of variables of the same number of rows with unique row names, given class data.frame. If no variables are included, the row names determine the number of rows.

data.frame converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE). As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to data.frame are converted to factor columns unless protected argument stringsAsFactors is false. If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument.

17.2 Creating Data Frames

Data frames are usually created by reading in a dataset from file, scraping from websites. However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists. In this case I’ll create a simple data frame df and assess its basic structure:

df <- data.frame(id = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7))
df

A data.frame: 5 × 4
id	char_col	log_col	double_col
<int>	<chr>	<lgl>	<dbl>
1	a	TRUE	2.100000
2	b	TRUE	1.000000
3	c	TRUE	0.500000
4	d	FALSE	3.141593
5	e	TRUE	12.700000

# assess the structure of a data frame
str(df)

'data.frame':   5 obs. of  4 variables:
 $ id        : int  1 2 3 4 5
 $ char_col  : chr  "a" "b" "c" "d" ...
 $ log_col   : logi  TRUE TRUE TRUE FALSE TRUE
 $ double_col: num  2.1 1 0.5 3.14 12.7

# number of rows
nrow(df)

# number of columns
ncol(df)

If you want convert “on fly” character columns to factor use stringsAsFactors = TRUE:

df <- data.frame(i = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df

A data.frame: 5 × 4
i	char_col	log_col	double_col
<int>	<fct>	<lgl>	<dbl>
1	a	TRUE	2.100000
2	b	TRUE	1.000000
3	c	TRUE	0.500000
4	d	FALSE	3.141593
5	e	TRUE	12.700000

Creating data.frames from lists (P.S. lists explained in next chapter):

We can create data.frame from vectors:

v_int <- 1:5
v_char <- c("a", "b", "c", "d", "e")
v_log <- c(T,T,T,F,T)
v_double <- c(2.1, 1, 0.5, pi, 12.7)

demo_list <- list(int_col = v_int,
                  char_col = v_char,
                  log_col = v_log,
                  double_col = v_double)
as.data.frame(demo_list)

A data.frame: 5 × 4
int_col	char_col	log_col	double_col
<int>	<chr>	<lgl>	<dbl>
1	a	TRUE	2.100000
2	b	TRUE	1.000000
3	c	TRUE	0.500000
4	d	FALSE	3.141593
5	e	TRUE	12.700000

Matrix can be base for data frame too:

demo_matrix <- matrix(100:119, nrow = 5, ncol = 4)
demo_matrix

as.data.frame(demo_matrix)

A matrix: 5 × 4 of type int
100	105	110	115
101	106	111	116
102	107	112	117
103	108	113	118
104	109	114	119

A data.frame: 5 × 4
V1	V2	V3	V4
<int>	<int>	<int>	<int>
100	105	110	115
101	106	111	116
102	107	112	117
103	108	113	118
104	109	114	119

17.3 Extending data frames

You can add rows and columns to data frame. Merging two data frames by selected column values awailable too.

cbind() adds new column

df <-  data.frame(A1 = c("A", "B", "C"),
                  A2 = c("D", "E", "F"))
df

A3 = c(1, 2, 3)
cbind(df, A3)


colnames(df)
colnames(df) <- c("B1", "B2")
colnames(df)

A data.frame: 3 × 2
A1	A2
<chr>	<chr>
A	D
B	E
C	F

A data.frame: 3 × 3
A1	A2	A3
<chr>	<chr>	<dbl>
A	D	1
B	E	2
C	F	3

'A1'
'A2'

'B1'
'B2'

rbind() adds new row

letters_frame <-  data.frame(A1 = c("A", "B", "C"),
                            A2 = 1:3)
letters_frame

next_row = c("D", 4) # data types by row should be the same as in initial data frame
rbind(letters_frame, next_row)

A data.frame: 3 × 2
A1	A2
<chr>	<int>
A	1
B	2
C	3

A data.frame: 4 × 2
A1	A2
<chr>	<chr>
A	1
B	2
C	3
D	4

17.4 Merge DF

Data frames could me merged by key with merge():

df1 <- data.frame(Id = c(1:4),
                  Name = c("Nick", "Jake", "Jane", "Mary"))
df1

df2 <- data.frame(Id = c(2, 1, 3, 5), # defferent order from Id in df1
                  Age = c(34, 21, 45, 20))
df2

df_final <- merge(df1, df2, by = "Id", all.x = F, all.y = F)
df_final

A data.frame: 4 × 2
Id	Name
<int>	<chr>
1	Nick
2	Jake
3	Jane
4	Mary

A data.frame: 4 × 2
Id	Age
<dbl>	<dbl>
2	34
1	21
3	45
5	20

A data.frame: 3 × 3
Id	Name	Age
<int>	<chr>	<dbl>
1	Nick	21
2	Jake	34
3	Jane	45

17.4.1 Subsetting Data Frames

Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists and will return the selected columns with all rows; if you subset with two vectors, they behave like matrices and can be subset by row and column:

df <- data.frame(int_col = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                row.names = paste0("row_", 1:5), # setting row names 
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df

A data.frame: 5 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_1	1	a	TRUE	2.100000
row_2	2	b	TRUE	1.000000
row_3	3	c	TRUE	0.500000
row_4	4	d	FALSE	3.141593
row_5	5	e	TRUE	12.700000

# select columns using $ sign
df$log_col

TRUE
TRUE
TRUE
FALSE
TRUE

# subsetting by row numbers
df[1, ] # first row
df[nrow(df), ] # last row
df[-1, ] # all except first row

A data.frame: 1 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_1	1	a	TRUE	2.1

A data.frame: 1 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_5	5	e	TRUE	12.7

A data.frame: 4 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_2	2	b	TRUE	1.000000
row_3	3	c	TRUE	0.500000
row_4	4	d	FALSE	3.141593
row_5	5	e	TRUE	12.700000

# subsetting by row names
df[c("row_4", "row_5"), ]

A data.frame: 2 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_4	4	d	FALSE	3.141593
row_5	5	e	TRUE	12.700000

# subsetting columns like a list
df[, c("log_col", "double_col")]

A data.frame: 5 × 2
	log_col	double_col
	<lgl>	<dbl>
row_1	TRUE	2.100000
row_2	TRUE	1.000000
row_3	TRUE	0.500000
row_4	FALSE	3.141593
row_5	TRUE	12.700000

# subset for both rows and columns
df[2:5, c(1, 3:4)]

A data.frame: 4 × 3
	int_col	log_col	double_col
	<int>	<lgl>	<dbl>
row_2	2	TRUE	1.000000
row_3	3	TRUE	0.500000
row_4	4	FALSE	3.141593
row_5	5	TRUE	12.700000

You can also subset data frames based on conditional statements

# select only with log_col == TRUE
df[df$double_col > 1, ]

df[!df$log_col, ]

A data.frame: 3 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_1	1	a	TRUE	2.100000
row_4	4	d	FALSE	3.141593
row_5	5	e	TRUE	12.700000

A data.frame: 1 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_4	4	d	FALSE	3.141593

"s" %in% c("s", "t")

TRUE

# select only with char_col == 'a', 'e'
chars <- df$char_col %in% c("a", "e")
chars
sum(chars)
df[chars, ] # %in% operator for check multuiple values

TRUE
FALSE
FALSE
FALSE
TRUE

A data.frame: 2 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_1	1	a	TRUE	2.1
row_5	5	e	TRUE	12.7

# select only with double_col > 1 and log_col == TRUE
df[df$log_col == TRUE & df$double_col > 1, ]

A data.frame: 2 × 4
	int_col	char_col	log_col	double_col
	<int>	<fct>	<lgl>	<dbl>
row_1	1	a	TRUE	2.1
row_5	5	e	TRUE	12.7

# select only specific columns with double_col > 1 and log_col == TRUE
df[df$log_col == TRUE & df$double_col > 1, c("log_col", "int_col", "double_col")]

A data.frame: 2 × 3
	log_col	int_col	double_col
	<lgl>	<int>	<dbl>
row_1	TRUE	1	2.1
row_5	TRUE	5	12.7

17.5 Order data.frame

Let’s use our previous sample data.frame but with unordered values:

df <- data.frame(int_col = c(1, 5, 3, 4, 2),
                char_col = c("b", "a", "a", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                row.names = paste0("row_", 1:5), # setting row names 
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df

A data.frame: 5 × 4
	int_col	char_col	log_col	double_col
	<dbl>	<fct>	<lgl>	<dbl>
row_1	1	b	TRUE	2.100000
row_2	5	a	TRUE	1.000000
row_3	3	a	TRUE	0.500000
row_4	4	d	FALSE	3.141593
row_5	2	e	TRUE	12.700000

You can use order() function for sorting data.frames.

# sort by int_col
order(df$char_col)
order(df$int_col)
df[order(df$char_col),]

A data.frame: 5 × 4
	int_col	char_col	log_col	double_col
	<dbl>	<fct>	<lgl>	<dbl>
row_2	5	a	TRUE	1.000000
row_3	3	a	TRUE	0.500000
row_1	1	b	TRUE	2.100000
row_4	4	d	FALSE	3.141593
row_5	2	e	TRUE	12.700000

Use - minus to sort descending

# sort by double_col
# rev
df[rev(order(df$int_col)), ]

A data.frame: 5 × 4
	int_col	char_col	log_col	double_col
	<dbl>	<fct>	<lgl>	<dbl>
row_2	5	a	TRUE	1.000000
row_4	4	d	FALSE	3.141593
row_3	3	a	TRUE	0.500000
row_5	2	e	TRUE	12.700000
row_1	1	b	TRUE	2.100000

You can also sor by multiple columns with order(column1, column2) or order(column1, -column2).

17.5.1 Manipulating `data.frames`

typeconvert
ifelse
createnew columns (calculate age) ?lubridate
missing remove
missing replace
edit with dataeditR

17.5.2 Tasks

17.5.3 Task 1

Write a code evaluates $y = x^2 + e, where x is a random number in range [0; 1].

Print calculation result as data.frame with columns X, E, Y.

Use plot() funtion to visualize X vs Y as line chart (type = l or b).

Solution

# initiate data.frame
df <- data.frame(X = 1:10,
                 E = sample(5, 10, replace = T),
                 Y = NA)
head(df)

A data.frame: 6 × 3
	X	E	Y
	<int>	<int>	<lgl>
1	1	4	NA
2	2	1	NA
3	3	2	NA
4	4	3	NA
5	5	2	NA
6	6	2	NA

df$Y <- with(df, X^2 + E)
head(df)

A data.frame: 6 × 3
	X	E	Y
	<int>	<int>	<dbl>
1	1	4	5
2	2	1	5
3	3	2	11
4	4	3	19
5	5	2	27
6	6	2	38

plot(df$X, df$Y, type="l", col = "blue")

17.5.4 Task 2

Install package and load package ISLR
Save dataset Credit into variable credit_data.
Check dataset structure with str() function.
Convert Student status “yes/no” to 1/0
Order dataset by Rating descending
Filter only Age > 50 with Rating > 400, how many records do you get?
Evaluate average Income for Married = YES Married = NO with Age in range [20,30]

7.1 Make the same for Age [30;40] Any conclusion?