11  Exploring data with dplyr


author: Юрій Клебан


11.1 Basic funtions and dataset explore

There are most popular functions in dplyr is listed in table.

dplyr Function Description Equivalent SQL
select() Selecting columns (variables) SELECT
filter() Filter (subset) rows. WHERE
group_by() Group the data GROUP BY
summarise() Summarise (or aggregate) data -
arrange() Sort the data ORDER BY
join() Joining data frames (tables) JOIN
mutate() Creating New Variables COLUMN ALIAS

For the next sample we are going to use gapminder dataset. Go to gapminder dataset description

The gapminder data frame include six variables:

variable meaning
country -
continent -
year -
lifeExp life expectancy at birth
pop total population
gdpPercap per-capita GDP

Per-capita GDP (Gross domestic product) is given in units of international dollars, a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time – 2005, in this case.

The gapminder data frame is a special kind of data frame: a tibble.

library(dplyr) # for demos
#install.packages("gapminder")
library(gapminder)  # load package and dataset
class(gapminder)
  1. 'tbl_df'
  2. 'tbl'
  3. 'data.frame'

Let’s preview it with functions str(), glimpse(), head(), tail(), summary().

str(gapminder)
tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~
head(gapminder) #shows first n-rows, 6 by default
A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
tail(gapminder) #shows last n-rows, 6 by default
A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Zimbabwe Africa 1982 60.363 7636524 788.8550
Zimbabwe Africa 1987 62.351 9216418 706.1573
Zimbabwe Africa 1992 60.377 10704340 693.4208
Zimbabwe Africa 1997 46.809 11404948 792.4500
Zimbabwe Africa 2002 39.989 11926563 672.0386
Zimbabwe Africa 2007 43.487 12311143 469.7093
summary(gapminder)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

11.2 filter() function

austria <- filter(gapminder, country == "Austria")
austria
A tibble: 12 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Austria Europe 1952 66.800 6927772 6137.076
Austria Europe 1957 67.480 6965860 8842.598
Austria Europe 1962 69.540 7129864 10750.721
Austria Europe 1967 70.140 7376998 12834.602
Austria Europe 1972 70.630 7544201 16661.626
Austria Europe 1977 72.170 7568430 19749.422
Austria Europe 1982 73.180 7574613 21597.084
Austria Europe 1987 74.940 7578903 23687.826
Austria Europe 1992 76.040 7914969 27042.019
Austria Europe 1997 77.510 8069876 29095.921
Austria Europe 2002 78.980 8148312 32417.608
Austria Europe 2007 79.829 8199783 36126.493

filter() takes logical expressions and returns the rows for which all are TRUE.

# task: select rows with lifeExp less than 31
filter(gapminder, lifeExp < 31)
A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Angola Africa 1952 30.015 4232095 3520.6103
Gambia Africa 1952 30.000 284320 485.2307
Rwanda Africa 1992 23.599 7290203 737.0686
Sierra Leone Africa 1952 30.331 2143249 879.7877
# task: select Austria only and year after 1980
filter(gapminder, country == "Austria", year > 1980)
A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Austria Europe 1982 73.180 7574613 21597.08
Austria Europe 1987 74.940 7578903 23687.83
Austria Europe 1992 76.040 7914969 27042.02
Austria Europe 1997 77.510 8069876 29095.92
Austria Europe 2002 78.980 8148312 32417.61
Austria Europe 2007 79.829 8199783 36126.49
# task: select Austria and Belgium
filter(gapminder, country %in% c("Austria", "Belgium"))
A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Austria Europe 1952 66.800 6927772 6137.076
Austria Europe 1957 67.480 6965860 8842.598
Austria Europe 1962 69.540 7129864 10750.721
Austria Europe 1967 70.140 7376998 12834.602
Austria Europe 1972 70.630 7544201 16661.626
Austria Europe 1977 72.170 7568430 19749.422
Austria Europe 1982 73.180 7574613 21597.084
Austria Europe 1987 74.940 7578903 23687.826
Austria Europe 1992 76.040 7914969 27042.019
Austria Europe 1997 77.510 8069876 29095.921
Austria Europe 2002 78.980 8148312 32417.608
Austria Europe 2007 79.829 8199783 36126.493
Belgium Europe 1952 68.000 8730405 8343.105
Belgium Europe 1957 69.240 8989111 9714.961
Belgium Europe 1962 70.250 9218400 10991.207
Belgium Europe 1967 70.940 9556500 13149.041
Belgium Europe 1972 71.440 9709100 16672.144
Belgium Europe 1977 72.800 9821800 19117.974
Belgium Europe 1982 73.930 9856303 20979.846
Belgium Europe 1987 75.350 9870200 22525.563
Belgium Europe 1992 76.460 10045622 25575.571
Belgium Europe 1997 77.530 10199787 27561.197
Belgium Europe 2002 78.320 10311970 30485.884
Belgium Europe 2007 79.441 10392226 33692.605

Lets rewrite initial code and record it to the variable/data.frame:

11.3 Pipe (%>%/|>) operator

%>% is pipe operator. The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.

head() function without pipe and top 4 items:

In R version before 4.1.0 pipe %>% operator is not a language build-in and you should install magrittr package:

Pipe opertor in R 4.1+ |>, using this is preferable

#install.packages("magrittr") # for pipe %>% operator
library(magrittr)
head(gapminder, n = 4)
A tibble: 4 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971

head() function with pipe and top 4 items:

gapminder %>% head(4)
A tibble: 4 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971

Output is the same. So, let’s rewrire filtering for Austria with pipe:

austria <- gapminder |> filter(country == "Austria")
austria
A tibble: 12 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Austria Europe 1952 66.800 6927772 6137.076
Austria Europe 1957 67.480 6965860 8842.598
Austria Europe 1962 69.540 7129864 10750.721
Austria Europe 1967 70.140 7376998 12834.602
Austria Europe 1972 70.630 7544201 16661.626
Austria Europe 1977 72.170 7568430 19749.422
Austria Europe 1982 73.180 7574613 21597.084
Austria Europe 1987 74.940 7578903 23687.826
Austria Europe 1992 76.040 7914969 27042.019
Austria Europe 1997 77.510 8069876 29095.921
Austria Europe 2002 78.980 8148312 32417.608
Austria Europe 2007 79.829 8199783 36126.493
# add more conditions in filter
austria <- gapminder |> filter(country == "Austria", year > 2000)
austria
A tibble: 2 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Austria Europe 2002 78.980 8148312 32417.61
Austria Europe 2007 79.829 8199783 36126.49

11.4 select() function

Use select() to subset the data on variables/columns by names or index. You also can define order of columns with select().

gapminder |> 
select(year, country, pop) |>
slice(1: 10)
A tibble: 10 × 3
year country pop
<int> <fct> <int>
1952 Afghanistan 8425333
1957 Afghanistan 9240934
1962 Afghanistan 10267083
1967 Afghanistan 11537966
1972 Afghanistan 13079460
1977 Afghanistan 14880372
1982 Afghanistan 12881816
1987 Afghanistan 13867957
1992 Afghanistan 16317921
1997 Afghanistan 22227415

Lets combine few functions with pipe (%>%):

Finally, lest extend our filtering:

# compare dplyr syntax with base R call
gapminder[gapminder$country == "Austria", c("year", "pop", "lifeExp")]

gapminder |> 
    filter(country == "Austria") |>
    select(year, pop, lifeExp)
A tibble: 12 × 3
year pop lifeExp
<int> <int> <dbl>
1952 6927772 66.800
1957 6965860 67.480
1962 7129864 69.540
1967 7376998 70.140
1972 7544201 70.630
1977 7568430 72.170
1982 7574613 73.180
1987 7578903 74.940
1992 7914969 76.040
1997 8069876 77.510
2002 8148312 78.980
2007 8199783 79.829
A tibble: 12 × 3
year pop lifeExp
<int> <int> <dbl>
1952 6927772 66.800
1957 6965860 67.480
1962 7129864 69.540
1967 7376998 70.140
1972 7544201 70.630
1977 7568430 72.170
1982 7574613 73.180
1987 7578903 74.940
1992 7914969 76.040
1997 8069876 77.510
2002 8148312 78.980
2007 8199783 79.829

You can remove some columns using minus(operator) and add few filter conditions:

austria <- gapminder |> 
                filter(country == "Austria", year > 2000) |>
                select(-continent, -gdpPercap) |>
                head()
austria
A tibble: 2 × 4
country year lifeExp pop
<fct> <int> <dbl> <int>
Austria 2002 78.980 8148312
Austria 2007 79.829 8199783

You can insert different conditions about columns you need to select.

gapminder |>
    select(!where(is.numeric)) |>  # its 1704 records, because of repeating some records
    slice(1:5)
A tibble: 5 × 2
country continent
<fct> <fct>
Afghanistan Asia
Afghanistan Asia
Afghanistan Asia
Afghanistan Asia
Afghanistan Asia

Let’s output all unique pairs continent -> country with distinct() function:

gapminder |>
    select(country) |>
    distinct() # its 142 records now
A tibble: 142 × 1
country
<fct>
Afghanistan
Albania
Algeria
Angola
Argentina
Australia
Austria
Bahrain
Bangladesh
Belgium
Benin
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo, Dem. Rep.
Congo, Rep.
Costa Rica
Sierra Leone
Singapore
Slovak Republic
Slovenia
Somalia
South Africa
Spain
Sri Lanka
Sudan
Swaziland
Sweden
Switzerland
Syria
Taiwan
Tanzania
Thailand
Togo
Trinidad and Tobago
Tunisia
Turkey
Uganda
United Kingdom
United States
Uruguay
Venezuela
Vietnam
West Bank and Gaza
Yemen, Rep.
Zambia
Zimbabwe

11.5 Selecting random \(N\) rows

The sample_n() function selects random rows from a data frame

gapminder |> sample_n(5)
A tibble: 5 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Norway Europe 1967 74.080 3786019 16361.8765
Central African Republic Africa 2002 43.308 4048013 738.6906
Uruguay Americas 2007 76.384 3447496 10611.4630
Togo Africa 1997 58.390 4320890 982.2869
Paraguay Americas 2002 70.755 5884491 3783.6742

If you want make pseudo-random generation reprodusable use set.seed(). Seed is start point of random generation. Different seeds give different output.

set.seed(2023) # example, seed = 2023

The sample_frac() function selects random fraction rows from a data frame. Let’s select \(1\%\) of data

set.seed(2023) # output not changing, uncomment it 
gapminder %>% sample_frac(0.1)
A tibble: 170 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
Switzerland Europe 2007 81.701 7554661 37506.4191
Djibouti Africa 2002 53.373 447416 1908.2609
Slovenia Europe 1972 69.820 1694510 12383.4862
Sao Tome and Principe Africa 1997 63.306 145608 1339.0760
Turkey Europe 1987 63.108 52881328 5089.0437
Lebanon Asia 1957 59.489 1647412 6089.7869
Eritrea Africa 1972 44.142 2260187 514.3242
Philippines Asia 1972 58.065 40850141 1989.3741
Tunisia Africa 1972 55.602 5303507 2753.2860
Uganda Africa 1952 39.978 5824797 734.7535
Oman Asia 1972 52.143 829050 10618.0385
Australia Oceania 2007 81.235 20434176 34435.3674
Mali Africa 2002 51.818 10580176 951.4098
Equatorial Guinea Africa 1957 35.983 232922 426.0964
South Africa Africa 1982 58.161 31140029 8568.2662
Burundi Africa 1962 42.045 2961915 355.2032
Angola Africa 1992 40.647 8735988 2627.8457
Yemen, Rep. Asia 1952 32.548 4963829 781.7176
Croatia Europe 2007 75.748 4493312 14619.2227
Oman Asia 1992 71.197 1915208 18616.7069
Thailand Asia 1962 56.061 29263397 1002.1992
Comoros Africa 1952 40.715 153936 1102.9909
Eritrea Africa 1957 38.047 1542611 344.1619
Zambia Africa 2002 39.193 10595811 1071.6139
Cote d'Ivoire Africa 1987 54.655 10761098 2156.9561
South Africa Africa 1957 47.985 16151549 5487.1042
Paraguay Americas 1957 63.196 1770902 2046.1547
Kuwait Asia 1952 55.565 160000 108382.3529
Brazil Americas 1952 50.917 56602560 2108.9444
Canada Americas 1957 69.960 17010154 12489.9501
Swaziland Africa 1997 54.289 1054486 3876.7685
Myanmar Asia 2002 59.908 45598081 611.0000
Sao Tome and Principe Africa 1987 61.728 110812 1516.5255
Ghana Africa 1977 51.756 10538093 993.2240
Guinea-Bissau Africa 1997 44.873 1193708 796.6645
Guinea Africa 1992 48.576 6990574 794.3484
Haiti Americas 1957 40.696 3507701 1726.8879
Sao Tome and Principe Africa 2007 65.528 199579 1598.4351
Comoros Africa 1997 60.660 527982 1173.6182
Equatorial Guinea Africa 1972 40.516 277603 672.4123
Oman Asia 1982 62.728 1301048 12954.7910
Namibia Africa 1977 56.437 977026 3876.4860
Congo, Dem. Rep. Africa 1952 39.143 14100005 780.5423
Hong Kong, China Asia 1977 73.600 4583700 11186.1413
Bolivia Americas 1997 62.050 7693188 3326.1432
Panama Americas 2002 74.712 2990875 7356.0319
Nigeria Africa 1952 36.324 33119096 1077.2819
Malaysia Asia 2007 74.241 24821286 12451.6558
Japan Asia 1952 63.030 86459025 3216.9563
Albania Europe 1967 66.220 1984060 2760.1969
Portugal Europe 1997 75.970 10156415 17641.0316
Uruguay Americas 1952 66.071 2252965 5716.7667
Afghanistan Asia 1972 36.088 13079460 739.9811
Syria Asia 1987 66.974 11242847 3116.7743
Libya Africa 2002 72.737 5368585 9534.6775
Mauritania Africa 1962 44.248 1146757 1055.8960
Trinidad and Tobago Americas 1992 69.862 1183669 7370.9909
Netherlands Europe 1962 73.230 11805689 12790.8496
Reunion Africa 2007 76.442 798094 7670.1226
Honduras Americas 1957 44.665 1770390 2220.4877

11.6 Refences

  1. dplyr: A Grammar of Data Manipulation on https://cran.r-project.org/.
  2. Data Transformation with splyr::cheat sheet.
  3. DPLYR TUTORIAL : DATA MANIPULATION (50 EXAMPLES) by Deepanshu Bhalla.
  4. Dplyr Intro by Stat 545. 6.R Dplyr Tutorial: Data Manipulation(Join) & Cleaning(Spread). Introduction to Data Analysis
  5. Loan Default Prediction. Beginners data set for financial analytics Kaggle