19  Функції apply() [EN]

Автор

Юрій Клебан

19.1 Apply functions family

You can use a set of function for manipulating, accesing different data structures such as data.frame, list.

The apply() family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.

The called function could be:

  • An aggregating function, like for example the mean, or the sum (that return a number or scalar);
  • Other transforming or subsetting functions; and
  • Other vectorized functions, which yield more complex structures like lists, vectors, matrices, and arrays.

The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the apply(), lapply(), sapply(), vapply(), mapply(), rapply(), and tapply() functions.

Using of any functions depends on the structure of the data that you want to operate on and the format of the output that you need.

19.1.1 apply()

apply() operates on arrays (2D arrays are matrices).

Syntax is next: apply(X, MARGIN, FUN, ...), where

  • X is an array or a matrix if the dimension of the array is 2;
  • MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows, whereas with MARGIN=2, it works over columns. Note that when you use the construct MARGIN=c(1,2), it applies to both rows and columns; and
  • FUN, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).
# create a matrix

matrix  <- matrix(10:29, ncol = 5, nrow = 4)
matrix
A matrix: 4 × 5 of type int
1014182226
1115192327
1216202428
1317212529
# find sums by col
apply(matrix, 2, sum)
  1. 46
  2. 62
  3. 78
  4. 94
  5. 110

It your turn. TASK. Calculate average value of all rows:

apply(matrix, 1, mean)
  1. 18
  2. 19
  3. 20
  4. 21

19.1.2 lapply()

lapply() from apply() is:

  • It can be used for other objects like dataframes, lists or vectors; and
  • The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.

?lapply to check params of fucntion:

# ?lapply

Lets create list of data.frames:

df_a <- data.frame(Value1 = 1:5, Value2 = 101:105)
df_a
df_b <- data.frame(Value1 = 11:15, Value2 = 201:205)
df_c <- data.frame(Value1 = 16:20, Value2 = 301:305)
df_c

lapply(df_a$Value1, sum)
A data.frame: 5 × 2
Value1Value2
<int><int>
1101
2102
3103
4104
5105
A data.frame: 5 × 2
Value1Value2
<int><int>
16301
17302
18303
19304
20305
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
list_demo <- list(df_a, df_b, df_c)
list_demo
  1. A data.frame: 5 2
    Value1Value2
    <int><int>
    1101
    2102
    3103
    4104
    5105
  2. A data.frame: 5 × 2
    Value1Value2
    <int><int>
    11201
    12202
    13203
    14204
    15205
  3. A data.frame: 5 × 2
    Value1Value2
    <int><int>
    16301
    17302
    18303
    19304
    20305
# lets select the 2nd row of each data frame

lapply(list_demo, "[", , 2)
# list_demo - data
# "[" -  selection operator
# row index
# col index
    1. 101
    2. 102
    3. 103
    4. 104
    5. 105
    1. 201
    2. 202
    3. 203
    4. 204
    5. 205
    1. 301
    2. 302
    3. 303
    4. 304
    5. 305

TASK. Its your turn. Select all 1st rows of dataframes

lapply(list_demo, "[", 1,)
  1. A data.frame: 1 2
    Value1Value2
    <int><int>
    11101
  2. A data.frame: 1 × 2
    Value1Value2
    <int><int>
    111201
  3. A data.frame: 1 × 2
    Value1Value2
    <int><int>
    116301

TASK. Its your turn. Select all 1st elements (1st row, 1st col)

lapply(list_demo, "[", 1, 1)
  1. 1
  2. 11
  3. 16

You can apply function to all elemetns. Let’s make some names in lowercase

names_list <- list("John", "Jane", "Jake", "Jacob")
lower_names <- lapply(names_list, tolower) 
class(lower_names)
'list'

19.1.3 sapplay()

sapply() takes a list vector or dataframe as an input and returns the output in vector or matrix form. Lets use sapply() function in the previous example and check the result.

sapply(names_list, tolower) 
  1. 'john'
  2. 'jane'
  3. 'jake'
  4. 'jacob'

It tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().

Let’s try to get every 1st element of 2nd row from out list_demo:

list_demo
  1. A data.frame: 5 2
    Value1Value2
    <int><int>
    1101
    2102
    3103
    4104
    5105
  2. A data.frame: 5 × 2
    Value1Value2
    <int><int>
    11201
    12202
    13203
    14204
    15205
  3. A data.frame: 5 × 2
    Value1Value2
    <int><int>
    16301
    17302
    18303
    19304
    20305
data <- sapply(list_demo, "[", 2,1)
data
class(data)
  1. 2
  2. 12
  3. 17
'integer'
# lest set simplify = FASLE
data <- sapply(list_demo, "[", 2,1, simplify =F)
data
class(data)
  1. 2
  2. 12
  3. 17
'list'

19.1.4 aggregate()

This function is from package stats. It often used for grouping data by some key. Its from apply family, but working in the same way. So, its good idea discuss it now.

Syntax for data.frame:

aggregate(x,               # R object \
          by,              # List of variables (grouping elements) \
          FUN,             # Function to be applied for summary statistics\
          ...,             # Additional arguments to be passed to FUN\
          simplify = TRUE, # Whether to simplify results as much as possible or not\
          drop = TRUE)     # Whether to drop unused combinations of grouping values or not.

Syntax for formula:

# Formula aggregate(formula, # Input formula \ data, # List or data frame where the variables are stored \ FUN, # Function to be applied for summary statistics \ ..., # Additional arguments to be passed to FUN \ subset, # Observations to be used (optional) \ na.action = na.omit) # How to deal with NA values`

Lets use our credit_data from one of the previous tasks:

credit_data <- ISLR::Credit
head(credit_data)
ERROR: Error in loadNamespace(x): there is no package called 'ISLR'

19.2 Tasks

19.2.1 TASK 1. Calculate average Balance by Gender:

# lets use formula syntax
mean_age <- aggregate(Age ~ Gender, data = credit_data, mean)
mean_age 

n <- names(mean_age)
n[n == "Age"] = "Mean Age"
names(mean_age) = n
mean_age
A data.frame: 2 × 2
GenderAge
<fct><dbl>
Male 55.59585
Female55.73430
A data.frame: 2 × 2
GenderMean Age
<fct><dbl>
Male 55.59585
Female55.73430

19.2.2 TASK 2. Average Balance for Gender and Student status at the same time

group_bal <- aggregate(Age ~ Gender + Married, data = credit_data, mean)
group_bal
A data.frame: 4 × 3
GenderMarriedAge
<fct><fct><dbl>
Male No 57.13158
FemaleNo 57.36709
Male Yes54.59829
FemaleYes54.72656

19.2.3 Task 3

Try get aggregated average Income by Age. Order final dat.frame by age and make a plot().

group_inc <- aggregate(Income ~ Age + Gender, data = credit_data, mean)
head(group_inc, 10)
A data.frame: 10 × 3
AgeGenderIncome
<int><fct><dbl>
124 Male25.97400
225 Male29.56700
326 Male16.47900
427 Male39.70500
528 Male33.01700
629 Male17.95850
730 Male35.10467
831 Male43.52567
932 Male33.71150
1033 Male39.39733
levels(group_inc$Gender)
levels(group_inc$Gender) <- c("Male", "Female")

m_data <- group_inc[group_inc$Gender == "Male", ]
nrow(m_data)

f_data <- group_inc[group_inc$Gender == "Female", ]
nrow(f_data)
with(m_data, plot(Age, Income, type = "l", col="red"))
with(f_data, lines(Age, Income, type = "l", col ="blue"))
#plot(group_inc$Age, group_inc$Income, type = "b")
  1. ' Male'
  2. 'Female'
63
62


19.3 References

  1. The Comprehensive R Archive NetworkRcran: Url: https://cran.r-project.org/
  2. RStudio official website. Url: https://rstudio.com/
  3. Anaconda official website. Url: https://www.anaconda.com/
  4. Introduction to R. Datacamp interactive course. Url: https://www.datacamp.com/courses/free-introduction-to-r
  5. Quanargo. Introduction to R. Url: https://www.quantargo.com/courses/course-r-introduction
  6. R Coder Project. Begin your data science career with R language! Url: https://r-coder.com/
  7. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.
  8. A.B. Shipunov, EM Baldin, P.A. Volkova, VG Sufiyanov. Visual statistics. We use R! - M .: DMK Press, 2012. - 298 p .: ill.
  9. An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.html
  10. R programming. https://www.datamentor.io/r-programming
  11. Learn R. R Functions. https://www.w3schools.com/r/r_functions.asp
  12. UC Business Analytics R Programming Guide. Managing Data Frames. http://uc-r.github.io/dataframes
  13. Learn R programming. R - Lists. https://www.tutorialspoint.com/r/r_lists.htm
  14. Tutorial on the R Apply Family by Carlo Fanara. https://www.datacamp.com/community/tutorials/r-tutorial-apply-family