19 Функції `apply()` [EN]

Автор

Юрій Клебан

19.1 Apply functions family

You can use a set of function for manipulating, accesing different data structures such as data.frame, list.

The apply() family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.

The called function could be:

An aggregating function, like for example the mean, or the sum (that return a number or scalar);
Other transforming or subsetting functions; and
Other vectorized functions, which yield more complex structures like lists, vectors, matrices, and arrays.

The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the apply(), lapply(), sapply(), vapply(), mapply(), rapply(), and tapply() functions.

Using of any functions depends on the structure of the data that you want to operate on and the format of the output that you need.

19.1.1 `apply()`

apply() operates on arrays (2D arrays are matrices).

Syntax is next: apply(X, MARGIN, FUN, ...), where

X is an array or a matrix if the dimension of the array is 2;
MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows, whereas with MARGIN=2, it works over columns. Note that when you use the construct MARGIN=c(1,2), it applies to both rows and columns; and
FUN, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).

# create a matrix

matrix  <- matrix(10:29, ncol = 5, nrow = 4)
matrix

A matrix: 4 × 5 of type int
10	14	18	22	26
11	15	19	23	27
12	16	20	24	28
13	17	21	25	29

# find sums by col
apply(matrix, 2, sum)

It your turn. TASK. Calculate average value of all rows:

apply(matrix, 1, mean)

19.1.2 `lapply()`

lapply() from apply() is:

It can be used for other objects like dataframes, lists or vectors; and
The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.

?lapply to check params of fucntion:

# ?lapply

Lets create list of data.frames:

df_a <- data.frame(Value1 = 1:5, Value2 = 101:105)
df_a
df_b <- data.frame(Value1 = 11:15, Value2 = 201:205)
df_c <- data.frame(Value1 = 16:20, Value2 = 301:305)
df_c

lapply(df_a$Value1, sum)

A data.frame: 5 × 2
Value1	Value2
<int>	<int>
1	101
2	102
3	103
4	104
5	105

A data.frame: 5 × 2
Value1	Value2
<int>	<int>
16	301
17	302
18	303
19	304
20	305

list_demo <- list(df_a, df_b, df_c)
list_demo

A data.frame: 5 2

Value1 Value2

<int> <int>

1 101

2 102

3 103

4 104

5 105
A data.frame: 5 × 2

Value1 Value2

<int> <int>

11 201

12 202

13 203

14 204

15 205
A data.frame: 5 × 2

Value1 Value2

<int> <int>

16 301

17 302

18 303

19 304

20 305

A data.frame: 5 2
Value1	Value2
1	101
2	102
3	103
4	104
5	105

A data.frame: 5 × 2
Value1	Value2
11	201
12	202
13	203
14	204
15	205

A data.frame: 5 × 2
Value1	Value2
16	301
17	302
18	303
19	304
20	305

# lets select the 2nd row of each data frame

lapply(list_demo, "[", , 2)
# list_demo - data
# "[" -  selection operator
# row index
# col index

1. 101
2. 102
3. 103
4. 104
5. 105
1. 201
2. 202
3. 203
4. 204
5. 205
1. 301
2. 302
3. 303
4. 304
5. 305

TASK. Its your turn. Select all 1st rows of dataframes

lapply(list_demo, "[", 1,)

A data.frame: 1 2

Value1 Value2

<int> <int>

1 1 101
A data.frame: 1 × 2

Value1 Value2

<int> <int>

1 11 201
A data.frame: 1 × 2

Value1 Value2

<int> <int>

1 16 301

A data.frame: 1 2
	Value1	Value2
1	1	101

A data.frame: 1 × 2
	Value1	Value2
1	11	201

A data.frame: 1 × 2
	Value1	Value2
1	16	301

TASK. Its your turn. Select all 1st elements (1st row, 1st col)

lapply(list_demo, "[", 1, 1)

You can apply function to all elemetns. Let’s make some names in lowercase

names_list <- list("John", "Jane", "Jake", "Jacob")
lower_names <- lapply(names_list, tolower) 
class(lower_names)

'list'

19.1.3 `sapplay()`

sapply() takes a list vector or dataframe as an input and returns the output in vector or matrix form. Lets use sapply() function in the previous example and check the result.

sapply(names_list, tolower)

'john'
'jane'
'jake'
'jacob'

It tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().

Let’s try to get every 1st element of 2nd row from out list_demo:

list_demo

A data.frame: 5 2

Value1 Value2

<int> <int>

1 101

2 102

3 103

4 104

5 105
A data.frame: 5 × 2

Value1 Value2

<int> <int>

11 201

12 202

13 203

14 204

15 205
A data.frame: 5 × 2

Value1 Value2

<int> <int>

16 301

17 302

18 303

19 304

20 305

A data.frame: 5 2
Value1	Value2
1	101
2	102
3	103
4	104
5	105

A data.frame: 5 × 2
Value1	Value2
11	201
12	202
13	203
14	204
15	205

A data.frame: 5 × 2
Value1	Value2
16	301
17	302
18	303
19	304
20	305

data <- sapply(list_demo, "[", 2,1)
data
class(data)

'integer'

# lest set simplify = FASLE
data <- sapply(list_demo, "[", 2,1, simplify =F)
data
class(data)

'list'

19.1.4 `aggregate()`

This function is from package stats. It often used for grouping data by some key. Its from apply family, but working in the same way. So, its good idea discuss it now.

Syntax for data.frame:

aggregate(x,               # R object \
          by,              # List of variables (grouping elements) \
          FUN,             # Function to be applied for summary statistics\
          ...,             # Additional arguments to be passed to FUN\
          simplify = TRUE, # Whether to simplify results as much as possible or not\
          drop = TRUE)     # Whether to drop unused combinations of grouping values or not.

Syntax for formula:

# Formula aggregate(formula, # Input formula \ data, # List or data frame where the variables are stored \ FUN, # Function to be applied for summary statistics \ ..., # Additional arguments to be passed to FUN \ subset, # Observations to be used (optional) \ na.action = na.omit) # How to deal with NA values`

Lets use our credit_data from one of the previous tasks:

credit_data <- ISLR::Credit
head(credit_data)

ERROR: Error in loadNamespace(x): there is no package called 'ISLR'

19.2 Tasks

19.2.1 TASK 1. Calculate average Balance by Gender:

# lets use formula syntax
mean_age <- aggregate(Age ~ Gender, data = credit_data, mean)
mean_age 

n <- names(mean_age)
n[n == "Age"] = "Mean Age"
names(mean_age) = n
mean_age

A data.frame: 2 × 2
Gender	Age
<fct>	<dbl>
Male	55.59585
Female	55.73430

A data.frame: 2 × 2
Gender	Mean Age
<fct>	<dbl>
Male	55.59585
Female	55.73430

19.2.2 TASK 2. Average Balance for Gender and Student status at the same time

group_bal <- aggregate(Age ~ Gender + Married, data = credit_data, mean)
group_bal

A data.frame: 4 × 3
Gender	Married	Age
<fct>	<fct>	<dbl>
Male	No	57.13158
Female	No	57.36709
Male	Yes	54.59829
Female	Yes	54.72656

19.2.3 Task 3

Try get aggregated average Income by Age. Order final dat.frame by age and make a plot().

group_inc <- aggregate(Income ~ Age + Gender, data = credit_data, mean)
head(group_inc, 10)

A data.frame: 10 × 3
	Age	Gender	Income
	<int>	<fct>	<dbl>
1	24	Male	25.97400
2	25	Male	29.56700
3	26	Male	16.47900
4	27	Male	39.70500
5	28	Male	33.01700
6	29	Male	17.95850
7	30	Male	35.10467
8	31	Male	43.52567
9	32	Male	33.71150
10	33	Male	39.39733

levels(group_inc$Gender)
levels(group_inc$Gender) <- c("Male", "Female")

m_data <- group_inc[group_inc$Gender == "Male", ]
nrow(m_data)

f_data <- group_inc[group_inc$Gender == "Female", ]
nrow(f_data)
with(m_data, plot(Age, Income, type = "l", col="red"))
with(f_data, lines(Age, Income, type = "l", col ="blue"))
#plot(group_inc$Age, group_inc$Income, type = "b")

' Male'
'Female'

19.3 References

The Comprehensive R Archive NetworkRcran: Url: https://cran.r-project.org/
RStudio official website. Url: https://rstudio.com/
Anaconda official website. Url: https://www.anaconda.com/
Introduction to R. Datacamp interactive course. Url: https://www.datacamp.com/courses/free-introduction-to-r
Quanargo. Introduction to R. Url: https://www.quantargo.com/courses/course-r-introduction
R Coder Project. Begin your data science career with R language! Url: https://r-coder.com/
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.
A.B. Shipunov, EM Baldin, P.A. Volkova, VG Sufiyanov. Visual statistics. We use R! - M .: DMK Press, 2012. - 298 p .: ill.
An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.html
R programming. https://www.datamentor.io/r-programming
Learn R. R Functions. https://www.w3schools.com/r/r_functions.asp
UC Business Analytics R Programming Guide. Managing Data Frames. http://uc-r.github.io/dataframes
Learn R programming. R - Lists. https://www.tutorialspoint.com/r/r_lists.htm
Tutorial on the R Apply Family by Carlo Fanara. https://www.datacamp.com/community/tutorials/r-tutorial-apply-family