19 Wide-to-long tables
author: Юрій Клебан
Before start load packages
Some times your data is not in tidy format. Peole can collect data year by year in each column. It’s problem to use such data for feature engeniering and building prediction models. Let’s generate such data sample (quaterly salary of some people).
not_good_data <- data.frame(Name = c("Nick", "Jake", "Anna", "Jane", "Dina"),
q1_2021 = c(12442, 22131, 21343, 22111, 14123),
q2_2021 = c(13442, 22871, 20343, 22222, 14456),
q3_2021 = c(15482, 22031, 22456, 22444, 14533),
q4_2021 = c(14511, 20031, 21741, 22333, 14511))
not_good_data
Name | q1_2021 | q2_2021 | q3_2021 | q4_2021 |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> |
Nick | 12442 | 13442 | 15482 | 14511 |
Jake | 22131 | 22871 | 22031 | 20031 |
Anna | 21343 | 20343 | 22456 | 21741 |
Jane | 22111 | 22222 | 22444 | 22333 |
Dina | 14123 | 14456 | 14533 | 14511 |
better_data <- not_good_data %>%
gather(quater, salary, 2:5)
# gather(quater, salary, q1_2021:q4_2021) possible code too
better_data
Name | quater | salary |
---|---|---|
<chr> | <chr> | <dbl> |
Nick | q1_2021 | 12442 |
Jake | q1_2021 | 22131 |
Anna | q1_2021 | 21343 |
Jane | q1_2021 | 22111 |
Dina | q1_2021 | 14123 |
Nick | q2_2021 | 13442 |
Jake | q2_2021 | 22871 |
Anna | q2_2021 | 20343 |
Jane | q2_2021 | 22222 |
Dina | q2_2021 | 14456 |
Nick | q3_2021 | 15482 |
Jake | q3_2021 | 22031 |
Anna | q3_2021 | 22456 |
Jane | q3_2021 | 22444 |
Dina | q3_2021 | 14533 |
Nick | q4_2021 | 14511 |
Jake | q4_2021 | 20031 |
Anna | q4_2021 | 21741 |
Jane | q4_2021 | 22333 |
Dina | q4_2021 | 14511 |
To make our data tidier separate()
can split quater column into 2 (quater
and year
):
best_data <- better_data %>%
separate(quater, c("quater", "year"), sep = "_") %>% # separate
mutate(year = as.integer(year), # convert year to integer
quater = substr(better_data$quater, 2,2), # trim `q` from start
quater = as.integer(quater), # convert quater to integer
) %>%
head(10)
best_data
Name | quater | year | salary | |
---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | |
1 | Nick | 1 | 2021 | 12442 |
2 | Jake | 1 | 2021 | 22131 |
3 | Anna | 1 | 2021 | 21343 |
4 | Jane | 1 | 2021 | 22111 |
5 | Dina | 1 | 2021 | 14123 |
6 | Nick | 2 | 2021 | 13442 |
7 | Jake | 2 | 2021 | 22871 |
8 | Anna | 2 | 2021 | 20343 |
9 | Jane | 2 | 2021 | 22222 |
10 | Dina | 2 | 2021 | 14456 |
The unite()
function concanates two columns into one:
Name | Qt | salary | |
---|---|---|---|
<chr> | <chr> | <dbl> | |
1 | Nick | 1#2021 | 12442 |
2 | Jake | 1#2021 | 22131 |
3 | Anna | 1#2021 | 21343 |
4 | Jane | 1#2021 | 22111 |
5 | Dina | 1#2021 | 14123 |
6 | Nick | 2#2021 | 13442 |
7 | Jake | 2#2021 | 22871 |
8 | Anna | 2#2021 | 20343 |
9 | Jane | 2#2021 | 22222 |
10 | Dina | 2#2021 | 14456 |
# if dont want remove old columns use remove param
united_data <- best_data %>%
unite(Qt, quater, year, sep = "#", remove = F)
united_data
Name | Qt | quater | year | salary | |
---|---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <dbl> | |
1 | Nick | 1#2021 | 1 | 2021 | 12442 |
2 | Jake | 1#2021 | 1 | 2021 | 22131 |
3 | Anna | 1#2021 | 1 | 2021 | 21343 |
4 | Jane | 1#2021 | 1 | 2021 | 22111 |
5 | Dina | 1#2021 | 1 | 2021 | 14123 |
6 | Nick | 2#2021 | 2 | 2021 | 13442 |
7 | Jake | 2#2021 | 2 | 2021 | 22871 |
8 | Anna | 2#2021 | 2 | 2021 | 20343 |
9 | Jane | 2#2021 | 2 | 2021 | 22222 |
10 | Dina | 2#2021 | 2 | 2021 | 14456 |
If you need to make table like initial use spread()
function:
Name | q1_2021 | q2_2021 | q3_2021 | q4_2021 |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> |
Anna | 21343 | 20343 | 22456 | 21741 |
Dina | 14123 | 14456 | 14533 | 14511 |
Jake | 22131 | 22871 | 22031 | 20031 |
Jane | 22111 | 22222 | 22444 | 22333 |
Nick | 12442 | 13442 | 15482 | 14511 |
Let’s try to spread()
feild pop
of gapminder
by year:
gapminder %>% select(country, pop, year) %>%
spread(year, pop) %>%
head() # for shorter code
# now you can easy send data to your director in excel :)
country | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<fct> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> |
Afghanistan | 8425333 | 9240934 | 10267083 | 11537966 | 13079460 | 14880372 | 12881816 | 13867957 | 16317921 | 22227415 | 25268405 | 31889923 |
Albania | 1282697 | 1476505 | 1728137 | 1984060 | 2263554 | 2509048 | 2780097 | 3075321 | 3326498 | 3428038 | 3508512 | 3600523 |
Algeria | 9279525 | 10270856 | 11000948 | 12760499 | 14760787 | 17152804 | 20033753 | 23254956 | 26298373 | 29072015 | 31287142 | 33333216 |
Angola | 4232095 | 4561361 | 4826015 | 5247469 | 5894858 | 6162675 | 7016384 | 7874230 | 8735988 | 9875024 | 10866106 | 12420476 |
Argentina | 17876956 | 19610538 | 21283783 | 22934225 | 24779799 | 26983828 | 29341374 | 31620918 | 33958947 | 36203463 | 38331121 | 40301927 |
Australia | 8691212 | 9712569 | 10794968 | 11872264 | 13177000 | 14074100 | 15184200 | 16257249 | 17481977 | 18565243 | 19546792 | 20434176 |
Functions table:
Function | Objectives | Arguments |
---|---|---|
gather() |
Transform the data from wide to long | (data, key, value, na.rm = FALSE) |
spread() |
Transform the data from long to wide | (data, key, value) |
separate() |
Split one variables into two | (data, col, into, sep= ““, remove = TRUE) |
unite() |
Unite two variables into one | (data, col, conc ,sep= ““, remove = TRUE) |
19.1 Refences
- dplyr: A Grammar of Data Manipulation on https://cran.r-project.org/.
- Data Transformation with splyr::cheat sheet.
- DPLYR TUTORIAL : DATA MANIPULATION (50 EXAMPLES) by Deepanshu Bhalla.
- Dplyr Intro by Stat 545. 6.R Dplyr Tutorial: Data Manipulation(Join) & Cleaning(Spread). Introduction to Data Analysis
- Loan Default Prediction. Beginners data set for financial analytics Kaggle