14  Вектори [EN]

Автор

Юрій Клебан

This content is about variables, data types and basic operators in R.

invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

14.1 Announcement of vectors

A vector is a base data type in R that allows you to write a collection of elements of the same type with or without c()if it is a sequence of values.

Note. In essence, the function c() allows you to combine several vectors.

Consider for example the usual variable x:

x <- 10

In essence, x in this case is a vector consisting of one value of10. We can also write several elements to the variable x:

x <- c(1, 2, 2.5, 3)
x
  1. 1
  2. 2
  3. 2.5
  4. 3

Vector elements can be values of any type: numeric,character, logical, etc .:

v1 <- c(1, 3, 4, 6, 7)
v2 <- c(T, F, F, T, F)
v3 <- c("Hello", "my", "friend", "!")

Vector elements are also sequences created using the functions rep (), seq () and the operator ::

vtr <-  2:7
vtr
vtr <- 7:2
vtr
  1. 2
  2. 3
  3. 4
  4. 5
  5. 6
  6. 7
  1. 7
  2. 6
  3. 5
  4. 4
  5. 3
  6. 2

If you need to combine several vectors, use the c() function:

x <- 2:3
y <- c(4,6,9)
z <- c(x, y, 10:12, 100)
z
  1. 2
  2. 3
  3. 4
  4. 6
  5. 9
  6. 10
  7. 11
  8. 12
  9. 100

You can view brief descriptive statistics by vector using the ** summary() ** function:

summary(z)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    4.00    9.00   17.44   11.00  100.00 

14.2 Operations on vectors

The advantage of using vectors over writing each value in a separate variable is the ability to perform 1 operation on all elements of the vector or on several vectors simultaneously, for example, arithmetic operations of addition or multiplication.

v1 <- c(1, 3, 5)
v1
v1 * 10
  1. 1
  2. 3
  3. 5
  1. 10
  2. 30
  3. 50

From the example described above, it can be understood that the addition operation is essentially a superelement sum of vectors when the 1st element of the vector v1 is added to the 1st element of the vectorv2(1 + 2) and so on. Thus, the resulting vector will have the same length as the vectors v1 andv2.

However, there may be a situation when one of the vectors has a shorter length or even consists of 1 element:

v1 <- c(1, 3, 5, 7)
v2 <- c(2, 4)
v1 + v2
  1. 3
  2. 7
  3. 7
  4. 11

In this case, the number 2 will be added to each element of the vectorv1. In fact, this means that the vector v2 will look like c 2, 2), ie there will be a duplication of values to the length of the vectorv1 and then perform the operation of adding elements. Thus, the resulting vector will have the length of the longest of the vectors.

Consider a more complex case where there are vectors with different numbers of elements other than 1:

v1 <- c(2, 3)
v2 <- c(4, 5, 6, 7)
v3 <- c(1, 8, 9)
v1 + v2 + v3
  1. 7
  2. 16
  3. 17
  4. 11

To begin with, it should be noted that the interpreter warns that the lengths of the vectors are not multiples (if they were vectors of length 2, 4, 8, then there would be no warning).

If you extend each vector to the length of the maximum of them, repeating the elements cyclically, you get a set (marked added elements):

v1 <- c(2, 3,*2,*3)
v2 <- c(4, 5, 6, 7)
v3 <- c(1, 8, 9,*1)

Subtraction (-), division(/) and multiplication (*) operations are performed similarly.

The relation operators and logical operators also act element by element with respect to the vector, but the result is a collection (vector) of values of the logical type logical with the valuesTRUE/FALSE.

Consider an example of finding all elements of the array v1 that are greater than the corresponding index elements of the arrayv2:

v1 <- c(2, 4, 7, 9, 12)
v2 <- c(6, 4, 6, 7, 1)
v1 > v2
  1. FALSE
  2. FALSE
  3. TRUE
  4. TRUE
  5. TRUE

In essence, as a result of execution there is a comparison of each element of both vectors among themselves: 2>6,4>4, 7>6,9>7, 12>1.

Therefore, the previously studied operators (arithmetic, logical, relations) can be used to work with vectors as well.

14.3 Naming vector elements

In order to understand what vectors mean and what data is often described, analysts need to sign this data.

We will write down information about daily visits to the site by users during the week in the following way:

# Count of unique bank branch visits from Monday to Sunday
data <- c(1245, 2112, 1321, 1231, 2342, 1718, 1980)

Next, assign values to the days of the week using the names () function:

names(data) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
print(data)
   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
     1245      2112      1321      1231      2342      1718      1980 

Otherwise, this code could be written as follows:

data <- c(1245, 2112, 1321, 1231, 2342, 1718, 1980)
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
names(data) <- days
data
Monday
1245
Tuesday
2112
Wednesday
1321
Thursday
1231
Friday
2342
Saturday
1718
Sunday
1980

If we need to get information, for example, about the name of the 4th element of the vector, we can use the code:

names(data)
  1. 'Monday'
  2. 'Tuesday'
  3. 'Wednesday'
  4. 'Thursday'
  5. 'Friday'
  6. 'Saturday'
  7. 'Sunday'

The names () function allows not only to set the values of names for vector elements, but also to obtain information about them.


14.4 Access to vector elements

Indexing of elements inside the wind occurs from 1 ton, where n is the number of elements of the vector.

  Note. In R, the indexing of array, vector, and all other collection types begins with 1, not with 0.

Consider the previous example:

data <- c(1245, 2112, 1321, 1231, 2342, 1718, 1980)
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
names(data) <- days

In order to record information only about site visitors on Wednesday, you need to use the operator [] and specify the index of the element in the array:

data[3]
data[names(data) == 'Wednesday']
Wednesday: 1321
Wednesday: 1321

If there is a need to get several elements of the vector that are out of order, you can do it like this:

some_days <- data[c(1, 2, 5)]
some_days
Monday
1245
Tuesday
2112
Friday
2342

From the example above it is clear that the indices of the vector data are another vector c(1, 2, 5), so it can be declared as a separate variable:

indexes <- c(1, 2, 5)
some_days <- data[indexes]
some_days
Monday
1245
Tuesday
2112
Friday
2342

If there is a need to obtain information about several elements that are placed in a row, then for convenience (and in the case when such an array consists, for example, of 1000+ elements) use the operator :, for example:

working_days <- data[1:5]
working_days
Monday
1245
Tuesday
2112
Wednesday
1321
Thursday
1231
Friday
2342

Thus, all working days of the week are selected for the working_days vector.


14.5 Useful functions

Let’s take a look at some useful features that will simplify working with vectors. For further calculations we will use two vectors A andB:

A <- c(3, 5, 8, 2, 5, 4, 2)
B <- c(3, NA, 1, NA, 6, 4, 5)
A
B
  1. 3
  2. 5
  3. 8
  4. 2
  5. 5
  6. 4
  7. 2
  1. 3
  2. <NA>
  3. 1
  4. <NA>
  5. 6
  6. 4
  7. 5

Function sum(). This function is used to find the sum of the elements of the collection:

sum(A)
sum(B)
29
<NA>

An interesting point is that in the presence of gaps in the data (value NA) the calculation of the amount is impossible. In this case, the functions can take the additional parameter na.rm = T, whereT is an abbreviation of TRUE, which indicates the need to remove gaps in the data before performing the operation.

Note. You should check the documentation for such a parameter in the function. If it is not present, then it is necessary to carry out cleaning in other ways before work with the data.

sum(B, na.rm = T)
19

The mean () function is used to find the arithmetic mean of numbers:

mean(A)
mean(B, na.rm = T)
4.14285714285714
3.8

min () and max () functions allow you to find the minimum and maximum values, respectively:

min(A)
max(A)
2
8

Also to work in R there is a large number of built-in implemented functions to perform statistical, econometric and other research in the field of economics and beyond. Try the sd(), cov(), cor() functions.

The length () function helps to determine the “length” of a vector, ie the number of elements:

length(A)
length(B)
7
7

The unique () function identifies unique elements in an array:

A
unique(A)

print("---")

B
unique(B)
  1. 3
  2. 5
  3. 8
  4. 2
  5. 5
  6. 4
  7. 2
  1. 3
  2. 5
  3. 8
  4. 2
  5. 4
[1] "---"
  1. 3
  2. <NA>
  3. 1
  4. <NA>
  5. 6
  6. 4
  7. 5
  1. 3
  2. <NA>
  3. 1
  4. 6
  5. 4
  6. 5

The intersect() function allows you to find common elements of two vectors, so for vectors A andB common values are 3, 4 and5:

A
B
intersect(A, B)
  1. 3
  2. 5
  3. 8
  4. 2
  5. 5
  6. 4
  7. 2
  1. 3
  2. <NA>
  3. 1
  4. <NA>
  5. 6
  6. 4
  7. 5
  1. 3
  2. 5
  3. 4

Conversely, The union() function allows you to combine elements of both sets / vectors:

A
B
union(A, B)
  1. 3
  2. 5
  3. 8
  4. 2
  5. 5
  6. 4
  7. 2
  1. 3
  2. <NA>
  3. 1
  4. <NA>
  5. 6
  6. 4
  7. 5
  1. 3
  2. 5
  3. 8
  4. 2
  5. 4
  6. <NA>
  7. 1
  8. 6

Try to understand the operation of the functions setdiff(), setequal(), is.element().

I recommend reading the short materials here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html.


14.6 Correction of data (NA, NaN, Inf)

In the process of working with data there are problems associated with the correctness of their reading, conversion and operations on them. For example, an incorrect entry in the field of integer type " +10 " instead of 10 may result in conversion toNaN or division by 0 toInf.

Before using numerical and other data, the stage of cleaning and replacement of values is usually performed depending on the tasks of programming / research. In R the following types of the missed values are possible:

  • NA ** - Not Available.
  • NaN ** - Not a Number.
  • Inf ** - Infinity (infinity, can be with the sign+and-).

Let’s start with vector:

vtr <- c(1, -2, NA, NaN, Inf, 1223, -Inf, NA, 21) 
vtr
  1. 1
  2. -2
  3. <NA>
  4. NaN
  5. Inf
  6. 1223
  7. -Inf
  8. <NA>
  9. 21

You can check a single value for a space with the functions is.na(), is.nan(), is.infinite(), is.finite().

is.na(vtr)
is.nan(vtr)
is.infinite(vtr)
is.finite(vtr) # if infinite == TRUE => finite == FALSE :)
  1. FALSE
  2. FALSE
  3. TRUE
  4. TRUE
  5. FALSE
  6. FALSE
  7. FALSE
  8. TRUE
  9. FALSE
  1. FALSE
  2. FALSE
  3. FALSE
  4. TRUE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. TRUE
  6. FALSE
  7. TRUE
  8. FALSE
  9. FALSE
  1. TRUE
  2. TRUE
  3. FALSE
  4. FALSE
  5. FALSE
  6. TRUE
  7. FALSE
  8. FALSE
  9. TRUE

Then replacement of values can be executed as follows (we will replace all NA and Nan with 1000):

vtr[is.na(vtr)] <- 1000
vtr

## Nan also replaced with is.na()!!!
  1. 1
  2. -2
  3. 1000
  4. 1000
  5. Inf
  6. 1223
  7. -Inf
  8. 1000
  9. 21

And then replace Inf with the maximum value in the vector, and -Inf with the minimum:

vtr[is.nan(vtr)] <- 500
vtr

vtr[is.na(vtr)] <- 1000
vtr

## Nan also replaced with is.na()!!!
  1. 1
  2. -2
  3. 1000
  4. 1000
  5. Inf
  6. 1223
  7. -Inf
  8. 1000
  9. 21
  1. 1
  2. -2
  3. 1000
  4. 1000
  5. Inf
  6. 1223
  7. -Inf
  8. 1000
  9. 21

And then replace Inf with the maximum value in the vector, and -Inf with the minimum:

vtr <- c(1, -2, NA, NaN, Inf, 1223, -Inf, NA, 21) 
vtr

is.infinite(vtr)
!is.infinite(vtr)
vtr[!is.infinite(vtr)]
max(vtr[!is.infinite(vtr)], na.rm = T)

max(vtr, na.rm = T)
min(vtr, na.rm = T)

vtr[vtr == Inf] <- max(vtr)
vtr[vtr == -Inf] <- min(vtr)
vtr
  1. 1
  2. -2
  3. <NA>
  4. NaN
  5. Inf
  6. 1223
  7. -Inf
  8. <NA>
  9. 21
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. TRUE
  6. FALSE
  7. TRUE
  8. FALSE
  9. FALSE
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. FALSE
  6. TRUE
  7. FALSE
  8. TRUE
  9. TRUE
  1. 1
  2. -2
  3. <NA>
  4. NaN
  5. 1223
  6. <NA>
  7. 21
1223
Inf
-Inf
  1. 1
  2. -2
  3. <NA>
  4. NaN
  5. <NA>
  6. 1223
  7. <NA>
  8. <NA>
  9. 21

If you want to replace the value in Inf regardless of the sign, you can useis.infinite().


14.7 Tasks

14.7.1 Task 1

  1. Create vector of 10 random number in range \([10;100]\)
  2. Replace all odd (непарні) numbers with NA
  3. Replace all NA with average value

Solution

x <- sample(1:100, size = 10)
x
  1. 74
  2. 42
  3. 38
  4. 20
  5. 28
  6. 97
  7. 44
  8. 87
  9. 70
  10. 40
x[x %% 2 != 0] <- NA
x
  1. 74
  2. 42
  3. 38
  4. 20
  5. 28
  6. <NA>
  7. 44
  8. <NA>
  9. 70
  10. 40
x[is.na(x)] <- mean(x, na.rm = T)
x
  1. 74
  2. 42
  3. 38
  4. 20
  5. 28
  6. 44.5
  7. 44
  8. 44.5
  9. 70
  10. 40

14.8 References

  1. The Comprehensive R Archive NetworkRcran: Url: https://cran.r-project.org/
  2. RStudio official website. Url: https://rstudio.com/
  3. Anaconda official website. Url: https://www.anaconda.com/
  4. Introduction to R. Datacamp interactive course. Url: https://www.datacamp.com/courses/free-introduction-to-r
  5. Quanargo. Introduction to R. Url: https://www.quantargo.com/courses/course-r-introduction
  6. R Coder Project. Begin your data science career with R language! Url: https://r-coder.com/
  7. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.
  8. A.B. Shipunov, EM Baldin, P.A. Volkova, VG Sufiyanov. Visual statistics. We use R! - M .: DMK Press, 2012. - 298 p .: ill.
  9. An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.html
  10. R programming. https://www.datamentor.io/r-programming
  11. Learn R. R Functions. https://www.w3schools.com/r/r_functions.asp
  12. UC Business Analytics R Programming Guide. Managing Data Frames. http://uc-r.github.io/dataframes
  13. Learn R programming. R - Lists. https://www.tutorialspoint.com/r/r_lists.htm
  14. Tutorial on the R Apply Family by Carlo Fanara. https://www.datacamp.com/community/tutorials/r-tutorial-apply-family