11 Exploring data with dplyr – Основи роботи з даними в R

11.1 Basic funtions and dataset explore

There are most popular functions in dplyr is listed in table.

dplyr Function	Description	Equivalent SQL
select()	Selecting columns (variables)	SELECT
filter()	Filter (subset) rows.	WHERE
group_by()	Group the data	GROUP BY
summarise()	Summarise (or aggregate) data	-
arrange()	Sort the data	ORDER BY
join()	Joining data frames (tables)	JOIN
mutate()	Creating New Variables	COLUMN ALIAS

For the next sample we are going to use gapminder dataset. Go to gapminder dataset description

The gapminder data frame include six variables:

variable	meaning
country	-
continent	-
year	-
lifeExp	life expectancy at birth
pop	total population
gdpPercap	per-capita GDP

Per-capita GDP (Gross domestic product) is given in units of international dollars, a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time – 2005, in this case.

The gapminder data frame is a special kind of data frame: a tibble.

library(dplyr) # for demos
#install.packages("gapminder")
library(gapminder)  # load package and dataset
class(gapminder)

'tbl_df'
'tbl'
'data.frame'

Let’s preview it with functions str(), glimpse(), head(), tail(), summary().

str(gapminder)

tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~

head(gapminder) #shows first n-rows, 6 by default

A tibble: 6 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134

tail(gapminder) #shows last n-rows, 6 by default

A tibble: 6 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Zimbabwe	Africa	1982	60.363	7636524	788.8550
Zimbabwe	Africa	1987	62.351	9216418	706.1573
Zimbabwe	Africa	1992	60.377	10704340	693.4208
Zimbabwe	Africa	1997	46.809	11404948	792.4500
Zimbabwe	Africa	2002	39.989	11926563	672.0386
Zimbabwe	Africa	2007	43.487	12311143	469.7093

summary(gapminder)

        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1

11.2 `filter()` function

austria <- filter(gapminder, country == "Austria")
austria

A tibble: 12 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Austria	Europe	1952	66.800	6927772	6137.076
Austria	Europe	1957	67.480	6965860	8842.598
Austria	Europe	1962	69.540	7129864	10750.721
Austria	Europe	1967	70.140	7376998	12834.602
Austria	Europe	1972	70.630	7544201	16661.626
Austria	Europe	1977	72.170	7568430	19749.422
Austria	Europe	1982	73.180	7574613	21597.084
Austria	Europe	1987	74.940	7578903	23687.826
Austria	Europe	1992	76.040	7914969	27042.019
Austria	Europe	1997	77.510	8069876	29095.921
Austria	Europe	2002	78.980	8148312	32417.608
Austria	Europe	2007	79.829	8199783	36126.493

filter() takes logical expressions and returns the rows for which all are TRUE.

# task: select rows with lifeExp less than 31
filter(gapminder, lifeExp < 31)

A tibble: 6 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Angola	Africa	1952	30.015	4232095	3520.6103
Gambia	Africa	1952	30.000	284320	485.2307
Rwanda	Africa	1992	23.599	7290203	737.0686
Sierra Leone	Africa	1952	30.331	2143249	879.7877

# task: select Austria only and year after 1980
filter(gapminder, country == "Austria", year > 1980)

A tibble: 6 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Austria	Europe	1982	73.180	7574613	21597.08
Austria	Europe	1987	74.940	7578903	23687.83
Austria	Europe	1992	76.040	7914969	27042.02
Austria	Europe	1997	77.510	8069876	29095.92
Austria	Europe	2002	78.980	8148312	32417.61
Austria	Europe	2007	79.829	8199783	36126.49

# task: select Austria and Belgium
filter(gapminder, country %in% c("Austria", "Belgium"))

A tibble: 24 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Austria	Europe	1952	66.800	6927772	6137.076
Austria	Europe	1957	67.480	6965860	8842.598
Austria	Europe	1962	69.540	7129864	10750.721
Austria	Europe	1967	70.140	7376998	12834.602
Austria	Europe	1972	70.630	7544201	16661.626
Austria	Europe	1977	72.170	7568430	19749.422
Austria	Europe	1982	73.180	7574613	21597.084
Austria	Europe	1987	74.940	7578903	23687.826
Austria	Europe	1992	76.040	7914969	27042.019
Austria	Europe	1997	77.510	8069876	29095.921
Austria	Europe	2002	78.980	8148312	32417.608
Austria	Europe	2007	79.829	8199783	36126.493
Belgium	Europe	1952	68.000	8730405	8343.105
Belgium	Europe	1957	69.240	8989111	9714.961
Belgium	Europe	1962	70.250	9218400	10991.207
Belgium	Europe	1967	70.940	9556500	13149.041
Belgium	Europe	1972	71.440	9709100	16672.144
Belgium	Europe	1977	72.800	9821800	19117.974
Belgium	Europe	1982	73.930	9856303	20979.846
Belgium	Europe	1987	75.350	9870200	22525.563
Belgium	Europe	1992	76.460	10045622	25575.571
Belgium	Europe	1997	77.530	10199787	27561.197
Belgium	Europe	2002	78.320	10311970	30485.884
Belgium	Europe	2007	79.441	10392226	33692.605

Lets rewrite initial code and record it to the variable/data.frame:

11.3 Pipe (`%>%`/`|>`) operator

%>% is pipe operator. The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.

head() function without pipe and top 4 items:

In R version before 4.1.0 pipe %>% operator is not a language build-in and you should install magrittr package:

Pipe opertor in R 4.1+ |>, using this is preferable

#install.packages("magrittr") # for pipe %>% operator
library(magrittr)

head(gapminder, n = 4)

A tibble: 4 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971

head() function with pipe and top 4 items:

gapminder %>% head(4)

A tibble: 4 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971

Output is the same. So, let’s rewrire filtering for Austria with pipe:

austria <- gapminder |> filter(country == "Austria")
austria

A tibble: 12 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Austria	Europe	1952	66.800	6927772	6137.076
Austria	Europe	1957	67.480	6965860	8842.598
Austria	Europe	1962	69.540	7129864	10750.721
Austria	Europe	1967	70.140	7376998	12834.602
Austria	Europe	1972	70.630	7544201	16661.626
Austria	Europe	1977	72.170	7568430	19749.422
Austria	Europe	1982	73.180	7574613	21597.084
Austria	Europe	1987	74.940	7578903	23687.826
Austria	Europe	1992	76.040	7914969	27042.019
Austria	Europe	1997	77.510	8069876	29095.921
Austria	Europe	2002	78.980	8148312	32417.608
Austria	Europe	2007	79.829	8199783	36126.493

# add more conditions in filter
austria <- gapminder |> filter(country == "Austria", year > 2000)
austria

A tibble: 2 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Austria	Europe	2002	78.980	8148312	32417.61
Austria	Europe	2007	79.829	8199783	36126.49

11.4 `select()` function

Use select() to subset the data on variables/columns by names or index. You also can define order of columns with select().

gapminder |> 
select(year, country, pop) |>
slice(1: 10)

A tibble: 10 × 3
year	country	pop
<int>	<fct>	<int>
1952	Afghanistan	8425333
1957	Afghanistan	9240934
1962	Afghanistan	10267083
1967	Afghanistan	11537966
1972	Afghanistan	13079460
1977	Afghanistan	14880372
1982	Afghanistan	12881816
1987	Afghanistan	13867957
1992	Afghanistan	16317921
1997	Afghanistan	22227415

Lets combine few functions with pipe (%>%):

Finally, lest extend our filtering:

# compare dplyr syntax with base R call
gapminder[gapminder$country == "Austria", c("year", "pop", "lifeExp")]

gapminder |> 
    filter(country == "Austria") |>
    select(year, pop, lifeExp)

A tibble: 12 × 3
year	pop	lifeExp
<int>	<int>	<dbl>
1952	6927772	66.800
1957	6965860	67.480
1962	7129864	69.540
1967	7376998	70.140
1972	7544201	70.630
1977	7568430	72.170
1982	7574613	73.180
1987	7578903	74.940
1992	7914969	76.040
1997	8069876	77.510
2002	8148312	78.980
2007	8199783	79.829

A tibble: 12 × 3
year	pop	lifeExp
<int>	<int>	<dbl>
1952	6927772	66.800
1957	6965860	67.480
1962	7129864	69.540
1967	7376998	70.140
1972	7544201	70.630
1977	7568430	72.170
1982	7574613	73.180
1987	7578903	74.940
1992	7914969	76.040
1997	8069876	77.510
2002	8148312	78.980
2007	8199783	79.829

You can remove some columns using minus(operator) and add few filter conditions:

austria <- gapminder |> 
                filter(country == "Austria", year > 2000) |>
                select(-continent, -gdpPercap) |>
                head()
austria

A tibble: 2 × 4
country	year	lifeExp	pop
<fct>	<int>	<dbl>	<int>
Austria	2002	78.980	8148312
Austria	2007	79.829	8199783

You can insert different conditions about columns you need to select.

gapminder |>
    select(!where(is.numeric)) |>  # its 1704 records, because of repeating some records
    slice(1:5)

A tibble: 5 × 2
country	continent
<fct>	<fct>
Afghanistan	Asia
Afghanistan	Asia
Afghanistan	Asia
Afghanistan	Asia
Afghanistan	Asia

Let’s output all unique pairs continent -> country with distinct() function:

gapminder |>
    select(country) |>
    distinct() # its 142 records now

A tibble: 142 × 1
country
<fct>
Afghanistan
Albania
Algeria
Angola
Argentina
Australia
Austria
Bahrain
Bangladesh
Belgium
Benin
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo, Dem. Rep.
Congo, Rep.
Costa Rica
⋮
Sierra Leone
Singapore
Slovak Republic
Slovenia
Somalia
South Africa
Spain
Sri Lanka
Sudan
Swaziland
Sweden
Switzerland
Syria
Taiwan
Tanzania
Thailand
Togo
Trinidad and Tobago
Tunisia
Turkey
Uganda
United Kingdom
United States
Uruguay
Venezuela
Vietnam
West Bank and Gaza
Yemen, Rep.
Zambia
Zimbabwe

11.5 Selecting random \(N\) rows

The sample_n() function selects random rows from a data frame

gapminder |> sample_n(5)

A tibble: 5 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Norway	Europe	1967	74.080	3786019	16361.8765
Central African Republic	Africa	2002	43.308	4048013	738.6906
Uruguay	Americas	2007	76.384	3447496	10611.4630
Togo	Africa	1997	58.390	4320890	982.2869
Paraguay	Americas	2002	70.755	5884491	3783.6742

If you want make pseudo-random generation reprodusable use set.seed(). Seed is start point of random generation. Different seeds give different output.

set.seed(2023) # example, seed = 2023

The sample_frac() function selects random fraction rows from a data frame. Let’s select \(1\%\) of data

set.seed(2023) # output not changing, uncomment it 
gapminder %>% sample_frac(0.1)

A tibble: 170 × 6
country	continent	year	lifeExp	pop	gdpPercap
<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
Switzerland	Europe	2007	81.701	7554661	37506.4191
Djibouti	Africa	2002	53.373	447416	1908.2609
Slovenia	Europe	1972	69.820	1694510	12383.4862
Sao Tome and Principe	Africa	1997	63.306	145608	1339.0760
Turkey	Europe	1987	63.108	52881328	5089.0437
Lebanon	Asia	1957	59.489	1647412	6089.7869
Eritrea	Africa	1972	44.142	2260187	514.3242
Philippines	Asia	1972	58.065	40850141	1989.3741
Tunisia	Africa	1972	55.602	5303507	2753.2860
Uganda	Africa	1952	39.978	5824797	734.7535
Oman	Asia	1972	52.143	829050	10618.0385
Australia	Oceania	2007	81.235	20434176	34435.3674
Mali	Africa	2002	51.818	10580176	951.4098
Equatorial Guinea	Africa	1957	35.983	232922	426.0964
South Africa	Africa	1982	58.161	31140029	8568.2662
Burundi	Africa	1962	42.045	2961915	355.2032
Angola	Africa	1992	40.647	8735988	2627.8457
Yemen, Rep.	Asia	1952	32.548	4963829	781.7176
Croatia	Europe	2007	75.748	4493312	14619.2227
Oman	Asia	1992	71.197	1915208	18616.7069
Thailand	Asia	1962	56.061	29263397	1002.1992
Comoros	Africa	1952	40.715	153936	1102.9909
Eritrea	Africa	1957	38.047	1542611	344.1619
Zambia	Africa	2002	39.193	10595811	1071.6139
Cote d'Ivoire	Africa	1987	54.655	10761098	2156.9561
South Africa	Africa	1957	47.985	16151549	5487.1042
Paraguay	Americas	1957	63.196	1770902	2046.1547
Kuwait	Asia	1952	55.565	160000	108382.3529
Brazil	Americas	1952	50.917	56602560	2108.9444
Canada	Americas	1957	69.960	17010154	12489.9501
⋮	⋮	⋮	⋮	⋮	⋮
Swaziland	Africa	1997	54.289	1054486	3876.7685
Myanmar	Asia	2002	59.908	45598081	611.0000
Sao Tome and Principe	Africa	1987	61.728	110812	1516.5255
Ghana	Africa	1977	51.756	10538093	993.2240
Guinea-Bissau	Africa	1997	44.873	1193708	796.6645
Guinea	Africa	1992	48.576	6990574	794.3484
Haiti	Americas	1957	40.696	3507701	1726.8879
Sao Tome and Principe	Africa	2007	65.528	199579	1598.4351
Comoros	Africa	1997	60.660	527982	1173.6182
Equatorial Guinea	Africa	1972	40.516	277603	672.4123
Oman	Asia	1982	62.728	1301048	12954.7910
Namibia	Africa	1977	56.437	977026	3876.4860
Congo, Dem. Rep.	Africa	1952	39.143	14100005	780.5423
Hong Kong, China	Asia	1977	73.600	4583700	11186.1413
Bolivia	Americas	1997	62.050	7693188	3326.1432
Panama	Americas	2002	74.712	2990875	7356.0319
Nigeria	Africa	1952	36.324	33119096	1077.2819
Malaysia	Asia	2007	74.241	24821286	12451.6558
Japan	Asia	1952	63.030	86459025	3216.9563
Albania	Europe	1967	66.220	1984060	2760.1969
Portugal	Europe	1997	75.970	10156415	17641.0316
Uruguay	Americas	1952	66.071	2252965	5716.7667
Afghanistan	Asia	1972	36.088	13079460	739.9811
Syria	Asia	1987	66.974	11242847	3116.7743
Libya	Africa	2002	72.737	5368585	9534.6775
Mauritania	Africa	1962	44.248	1146757	1055.8960
Trinidad and Tobago	Americas	1992	69.862	1183669	7370.9909
Netherlands	Europe	1962	73.230	11805689	12790.8496
Reunion	Africa	2007	76.442	798094	7670.1226
Honduras	Americas	1957	44.665	1770390	2220.4877

11.6 Refences

dplyr: A Grammar of Data Manipulation on https://cran.r-project.org/.
Data Transformation with splyr::cheat sheet.
DPLYR TUTORIAL : DATA MANIPULATION (50 EXAMPLES) by Deepanshu Bhalla.
Dplyr Intro by Stat 545. 6.R Dplyr Tutorial: Data Manipulation(Join) & Cleaning(Spread). Introduction to Data Analysis
Loan Default Prediction. Beginners data set for financial analytics Kaggle

11.1 Basic funtions and dataset explore

11.2 filter() function

11.3 Pipe (%>%/|>) operator

11.4 select() function

11.5 Selecting random \(N\) rows

11.6 Refences

11.2 `filter()` function

11.3 Pipe (`%>%`/`|>`) operator

11.4 `select()` function