library(dplyr) # for demos
#install.packages("gapminder")
library(gapminder) # load package and dataset
class(gapminder)- 'tbl_df'
- 'tbl'
- 'data.frame'
dplyrauthor: Юрій Клебан
There are most popular functions in dplyr is listed in table.
| dplyr Function | Description | Equivalent SQL |
|---|---|---|
| select() | Selecting columns (variables) | SELECT |
| filter() | Filter (subset) rows. | WHERE |
| group_by() | Group the data | GROUP BY |
| summarise() | Summarise (or aggregate) data | - |
| arrange() | Sort the data | ORDER BY |
| join() | Joining data frames (tables) | JOIN |
| mutate() | Creating New Variables | COLUMN ALIAS |
For the next sample we are going to use gapminder dataset. Go to gapminder dataset description
The gapminder data frame include six variables:
| variable | meaning |
|---|---|
| country | - |
| continent | - |
| year | - |
| lifeExp | life expectancy at birth |
| pop | total population |
| gdpPercap | per-capita GDP |
Per-capita GDP (Gross domestic product) is given in units of international dollars, a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time – 2005, in this case.
The gapminder data frame is a special kind of data frame: a tibble.
library(dplyr) # for demos
#install.packages("gapminder")
library(gapminder) # load package and dataset
class(gapminder)Let’s preview it with functions str(), glimpse(), head(), tail(), summary().
tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Zimbabwe | Africa | 1982 | 60.363 | 7636524 | 788.8550 |
| Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.1573 |
| Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.4208 |
| Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.4500 |
| Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.0386 |
| Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.7093 |
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
filter() function| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Austria | Europe | 1952 | 66.800 | 6927772 | 6137.076 |
| Austria | Europe | 1957 | 67.480 | 6965860 | 8842.598 |
| Austria | Europe | 1962 | 69.540 | 7129864 | 10750.721 |
| Austria | Europe | 1967 | 70.140 | 7376998 | 12834.602 |
| Austria | Europe | 1972 | 70.630 | 7544201 | 16661.626 |
| Austria | Europe | 1977 | 72.170 | 7568430 | 19749.422 |
| Austria | Europe | 1982 | 73.180 | 7574613 | 21597.084 |
| Austria | Europe | 1987 | 74.940 | 7578903 | 23687.826 |
| Austria | Europe | 1992 | 76.040 | 7914969 | 27042.019 |
| Austria | Europe | 1997 | 77.510 | 8069876 | 29095.921 |
| Austria | Europe | 2002 | 78.980 | 8148312 | 32417.608 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.493 |
filter() takes logical expressions and returns the rows for which all are TRUE.
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Angola | Africa | 1952 | 30.015 | 4232095 | 3520.6103 |
| Gambia | Africa | 1952 | 30.000 | 284320 | 485.2307 |
| Rwanda | Africa | 1992 | 23.599 | 7290203 | 737.0686 |
| Sierra Leone | Africa | 1952 | 30.331 | 2143249 | 879.7877 |
# task: select Austria only and year after 1980
filter(gapminder, country == "Austria", year > 1980)| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Austria | Europe | 1982 | 73.180 | 7574613 | 21597.08 |
| Austria | Europe | 1987 | 74.940 | 7578903 | 23687.83 |
| Austria | Europe | 1992 | 76.040 | 7914969 | 27042.02 |
| Austria | Europe | 1997 | 77.510 | 8069876 | 29095.92 |
| Austria | Europe | 2002 | 78.980 | 8148312 | 32417.61 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.49 |
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Austria | Europe | 1952 | 66.800 | 6927772 | 6137.076 |
| Austria | Europe | 1957 | 67.480 | 6965860 | 8842.598 |
| Austria | Europe | 1962 | 69.540 | 7129864 | 10750.721 |
| Austria | Europe | 1967 | 70.140 | 7376998 | 12834.602 |
| Austria | Europe | 1972 | 70.630 | 7544201 | 16661.626 |
| Austria | Europe | 1977 | 72.170 | 7568430 | 19749.422 |
| Austria | Europe | 1982 | 73.180 | 7574613 | 21597.084 |
| Austria | Europe | 1987 | 74.940 | 7578903 | 23687.826 |
| Austria | Europe | 1992 | 76.040 | 7914969 | 27042.019 |
| Austria | Europe | 1997 | 77.510 | 8069876 | 29095.921 |
| Austria | Europe | 2002 | 78.980 | 8148312 | 32417.608 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.493 |
| Belgium | Europe | 1952 | 68.000 | 8730405 | 8343.105 |
| Belgium | Europe | 1957 | 69.240 | 8989111 | 9714.961 |
| Belgium | Europe | 1962 | 70.250 | 9218400 | 10991.207 |
| Belgium | Europe | 1967 | 70.940 | 9556500 | 13149.041 |
| Belgium | Europe | 1972 | 71.440 | 9709100 | 16672.144 |
| Belgium | Europe | 1977 | 72.800 | 9821800 | 19117.974 |
| Belgium | Europe | 1982 | 73.930 | 9856303 | 20979.846 |
| Belgium | Europe | 1987 | 75.350 | 9870200 | 22525.563 |
| Belgium | Europe | 1992 | 76.460 | 10045622 | 25575.571 |
| Belgium | Europe | 1997 | 77.530 | 10199787 | 27561.197 |
| Belgium | Europe | 2002 | 78.320 | 10311970 | 30485.884 |
| Belgium | Europe | 2007 | 79.441 | 10392226 | 33692.605 |
Lets rewrite initial code and record it to the variable/data.frame:
%>%/|>) operator%>% is pipe operator. The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.
head() function without pipe and top 4 items:
In R version before 4.1.0
pipe%>%operator is not a language build-in and you should installmagrittrpackage:
Pipe opertor in R 4.1+
|>, using this is preferable
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
head() function with pipe and top 4 items:
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Output is the same. So, let’s rewrire filtering for Austria with pipe:
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Austria | Europe | 1952 | 66.800 | 6927772 | 6137.076 |
| Austria | Europe | 1957 | 67.480 | 6965860 | 8842.598 |
| Austria | Europe | 1962 | 69.540 | 7129864 | 10750.721 |
| Austria | Europe | 1967 | 70.140 | 7376998 | 12834.602 |
| Austria | Europe | 1972 | 70.630 | 7544201 | 16661.626 |
| Austria | Europe | 1977 | 72.170 | 7568430 | 19749.422 |
| Austria | Europe | 1982 | 73.180 | 7574613 | 21597.084 |
| Austria | Europe | 1987 | 74.940 | 7578903 | 23687.826 |
| Austria | Europe | 1992 | 76.040 | 7914969 | 27042.019 |
| Austria | Europe | 1997 | 77.510 | 8069876 | 29095.921 |
| Austria | Europe | 2002 | 78.980 | 8148312 | 32417.608 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.493 |
# add more conditions in filter
austria <- gapminder |> filter(country == "Austria", year > 2000)
austria| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Austria | Europe | 2002 | 78.980 | 8148312 | 32417.61 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.49 |
select() functionUse select() to subset the data on variables/columns by names or index. You also can define order of columns with select().
| year | country | pop |
|---|---|---|
| <int> | <fct> | <int> |
| 1952 | Afghanistan | 8425333 |
| 1957 | Afghanistan | 9240934 |
| 1962 | Afghanistan | 10267083 |
| 1967 | Afghanistan | 11537966 |
| 1972 | Afghanistan | 13079460 |
| 1977 | Afghanistan | 14880372 |
| 1982 | Afghanistan | 12881816 |
| 1987 | Afghanistan | 13867957 |
| 1992 | Afghanistan | 16317921 |
| 1997 | Afghanistan | 22227415 |
Lets combine few functions with pipe (%>%):
Finally, lest extend our filtering:
# compare dplyr syntax with base R call
gapminder[gapminder$country == "Austria", c("year", "pop", "lifeExp")]
gapminder |>
filter(country == "Austria") |>
select(year, pop, lifeExp)| year | pop | lifeExp |
|---|---|---|
| <int> | <int> | <dbl> |
| 1952 | 6927772 | 66.800 |
| 1957 | 6965860 | 67.480 |
| 1962 | 7129864 | 69.540 |
| 1967 | 7376998 | 70.140 |
| 1972 | 7544201 | 70.630 |
| 1977 | 7568430 | 72.170 |
| 1982 | 7574613 | 73.180 |
| 1987 | 7578903 | 74.940 |
| 1992 | 7914969 | 76.040 |
| 1997 | 8069876 | 77.510 |
| 2002 | 8148312 | 78.980 |
| 2007 | 8199783 | 79.829 |
| year | pop | lifeExp |
|---|---|---|
| <int> | <int> | <dbl> |
| 1952 | 6927772 | 66.800 |
| 1957 | 6965860 | 67.480 |
| 1962 | 7129864 | 69.540 |
| 1967 | 7376998 | 70.140 |
| 1972 | 7544201 | 70.630 |
| 1977 | 7568430 | 72.170 |
| 1982 | 7574613 | 73.180 |
| 1987 | 7578903 | 74.940 |
| 1992 | 7914969 | 76.040 |
| 1997 | 8069876 | 77.510 |
| 2002 | 8148312 | 78.980 |
| 2007 | 8199783 | 79.829 |
You can remove some columns using minus(operator) and add few filter conditions:
austria <- gapminder |>
filter(country == "Austria", year > 2000) |>
select(-continent, -gdpPercap) |>
head()
austria| country | year | lifeExp | pop |
|---|---|---|---|
| <fct> | <int> | <dbl> | <int> |
| Austria | 2002 | 78.980 | 8148312 |
| Austria | 2007 | 79.829 | 8199783 |
You can insert different conditions about columns you need to select.
gapminder |>
select(!where(is.numeric)) |> # its 1704 records, because of repeating some records
slice(1:5)| country | continent |
|---|---|
| <fct> | <fct> |
| Afghanistan | Asia |
| Afghanistan | Asia |
| Afghanistan | Asia |
| Afghanistan | Asia |
| Afghanistan | Asia |
Let’s output all unique pairs continent -> country with distinct() function:
| country |
|---|
| <fct> |
| Afghanistan |
| Albania |
| Algeria |
| Angola |
| Argentina |
| Australia |
| Austria |
| Bahrain |
| Bangladesh |
| Belgium |
| Benin |
| Bolivia |
| Bosnia and Herzegovina |
| Botswana |
| Brazil |
| Bulgaria |
| Burkina Faso |
| Burundi |
| Cambodia |
| Cameroon |
| Canada |
| Central African Republic |
| Chad |
| Chile |
| China |
| Colombia |
| Comoros |
| Congo, Dem. Rep. |
| Congo, Rep. |
| Costa Rica |
| ⋮ |
| Sierra Leone |
| Singapore |
| Slovak Republic |
| Slovenia |
| Somalia |
| South Africa |
| Spain |
| Sri Lanka |
| Sudan |
| Swaziland |
| Sweden |
| Switzerland |
| Syria |
| Taiwan |
| Tanzania |
| Thailand |
| Togo |
| Trinidad and Tobago |
| Tunisia |
| Turkey |
| Uganda |
| United Kingdom |
| United States |
| Uruguay |
| Venezuela |
| Vietnam |
| West Bank and Gaza |
| Yemen, Rep. |
| Zambia |
| Zimbabwe |
The sample_n() function selects random rows from a data frame
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Norway | Europe | 1967 | 74.080 | 3786019 | 16361.8765 |
| Central African Republic | Africa | 2002 | 43.308 | 4048013 | 738.6906 |
| Uruguay | Americas | 2007 | 76.384 | 3447496 | 10611.4630 |
| Togo | Africa | 1997 | 58.390 | 4320890 | 982.2869 |
| Paraguay | Americas | 2002 | 70.755 | 5884491 | 3783.6742 |
If you want make pseudo-random generation reprodusable use set.seed(). Seed is start point of random generation. Different seeds give different output.
The sample_frac() function selects random fraction rows from a data frame. Let’s select \(1\%\) of data
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| <fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
| Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.4191 |
| Djibouti | Africa | 2002 | 53.373 | 447416 | 1908.2609 |
| Slovenia | Europe | 1972 | 69.820 | 1694510 | 12383.4862 |
| Sao Tome and Principe | Africa | 1997 | 63.306 | 145608 | 1339.0760 |
| Turkey | Europe | 1987 | 63.108 | 52881328 | 5089.0437 |
| Lebanon | Asia | 1957 | 59.489 | 1647412 | 6089.7869 |
| Eritrea | Africa | 1972 | 44.142 | 2260187 | 514.3242 |
| Philippines | Asia | 1972 | 58.065 | 40850141 | 1989.3741 |
| Tunisia | Africa | 1972 | 55.602 | 5303507 | 2753.2860 |
| Uganda | Africa | 1952 | 39.978 | 5824797 | 734.7535 |
| Oman | Asia | 1972 | 52.143 | 829050 | 10618.0385 |
| Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.3674 |
| Mali | Africa | 2002 | 51.818 | 10580176 | 951.4098 |
| Equatorial Guinea | Africa | 1957 | 35.983 | 232922 | 426.0964 |
| South Africa | Africa | 1982 | 58.161 | 31140029 | 8568.2662 |
| Burundi | Africa | 1962 | 42.045 | 2961915 | 355.2032 |
| Angola | Africa | 1992 | 40.647 | 8735988 | 2627.8457 |
| Yemen, Rep. | Asia | 1952 | 32.548 | 4963829 | 781.7176 |
| Croatia | Europe | 2007 | 75.748 | 4493312 | 14619.2227 |
| Oman | Asia | 1992 | 71.197 | 1915208 | 18616.7069 |
| Thailand | Asia | 1962 | 56.061 | 29263397 | 1002.1992 |
| Comoros | Africa | 1952 | 40.715 | 153936 | 1102.9909 |
| Eritrea | Africa | 1957 | 38.047 | 1542611 | 344.1619 |
| Zambia | Africa | 2002 | 39.193 | 10595811 | 1071.6139 |
| Cote d'Ivoire | Africa | 1987 | 54.655 | 10761098 | 2156.9561 |
| South Africa | Africa | 1957 | 47.985 | 16151549 | 5487.1042 |
| Paraguay | Americas | 1957 | 63.196 | 1770902 | 2046.1547 |
| Kuwait | Asia | 1952 | 55.565 | 160000 | 108382.3529 |
| Brazil | Americas | 1952 | 50.917 | 56602560 | 2108.9444 |
| Canada | Americas | 1957 | 69.960 | 17010154 | 12489.9501 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| Swaziland | Africa | 1997 | 54.289 | 1054486 | 3876.7685 |
| Myanmar | Asia | 2002 | 59.908 | 45598081 | 611.0000 |
| Sao Tome and Principe | Africa | 1987 | 61.728 | 110812 | 1516.5255 |
| Ghana | Africa | 1977 | 51.756 | 10538093 | 993.2240 |
| Guinea-Bissau | Africa | 1997 | 44.873 | 1193708 | 796.6645 |
| Guinea | Africa | 1992 | 48.576 | 6990574 | 794.3484 |
| Haiti | Americas | 1957 | 40.696 | 3507701 | 1726.8879 |
| Sao Tome and Principe | Africa | 2007 | 65.528 | 199579 | 1598.4351 |
| Comoros | Africa | 1997 | 60.660 | 527982 | 1173.6182 |
| Equatorial Guinea | Africa | 1972 | 40.516 | 277603 | 672.4123 |
| Oman | Asia | 1982 | 62.728 | 1301048 | 12954.7910 |
| Namibia | Africa | 1977 | 56.437 | 977026 | 3876.4860 |
| Congo, Dem. Rep. | Africa | 1952 | 39.143 | 14100005 | 780.5423 |
| Hong Kong, China | Asia | 1977 | 73.600 | 4583700 | 11186.1413 |
| Bolivia | Americas | 1997 | 62.050 | 7693188 | 3326.1432 |
| Panama | Americas | 2002 | 74.712 | 2990875 | 7356.0319 |
| Nigeria | Africa | 1952 | 36.324 | 33119096 | 1077.2819 |
| Malaysia | Asia | 2007 | 74.241 | 24821286 | 12451.6558 |
| Japan | Asia | 1952 | 63.030 | 86459025 | 3216.9563 |
| Albania | Europe | 1967 | 66.220 | 1984060 | 2760.1969 |
| Portugal | Europe | 1997 | 75.970 | 10156415 | 17641.0316 |
| Uruguay | Americas | 1952 | 66.071 | 2252965 | 5716.7667 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Syria | Asia | 1987 | 66.974 | 11242847 | 3116.7743 |
| Libya | Africa | 2002 | 72.737 | 5368585 | 9534.6775 |
| Mauritania | Africa | 1962 | 44.248 | 1146757 | 1055.8960 |
| Trinidad and Tobago | Americas | 1992 | 69.862 | 1183669 | 7370.9909 |
| Netherlands | Europe | 1962 | 73.230 | 11805689 | 12790.8496 |
| Reunion | Africa | 2007 | 76.442 | 798094 | 7670.1226 |
| Honduras | Americas | 1957 | 44.665 | 1770390 | 2220.4877 |