9  Web-pages (HTML)

Автор

Юрій Клебан

Sometimes decision making needs scrap data from web sources and pages.

Let’s try to parse data from Wikipedia as table.

#install.packages("rvest")
library(rvest) # Parsing of HTML/XML files

Go to web page https://en.wikipedia.org/wiki/List_of_largest_banks and check it.

# fix URL
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks"
#url <- "data/List of largest banks - Wikipedia_.html"
# read html content of the page
page <- read_html(url)
page
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
# read all yables on page
tables <- html_nodes(page, "table")
tables
{xml_nodeset (4)}
[1] <table class="box-Missing_information plainlinks metadata ambox ambox-con ...
[2] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...
[3] <table class="wikitable sortable mw-collapsible">\n<caption>Number of ban ...
[4] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...

For now, let’s read a table of Total Assets in US Billion

# with pipe operator
#tables[2] %>% 
 #   html_table(fill = TRUE) %>% 
 #   as.data.frame()
#without pipe operator
assets_table <- as.data.frame(html_table(tables[2], fill = TRUE))   
head(assets_table)
A data.frame: 6 × 3
Rank Bank.name Total.assets.2020..US..billion.
<int> <chr> <chr>
1 1 Industrial and Commercial Bank of China 5,518.00
2 2 China Construction Bank 4,400.00
3 3 Agricultural Bank of China 4,300.00
4 4 Bank of China 4,200.00
5 5 JPMorgan Chase 3,831.65
6 6 Mitsubishi UFJ Financial Group 3,175.21

Next is reading data of market capitalization table (4th):

capital_table <- as.data.frame(html_table(tables[4], fill = TRUE))   
head(capital_table)
A data.frame: 6 × 3
Rank Bank.name Market.cap.US..billion.
<int> <chr> <dbl>
1 1 JPMorgan Chase 368.78
2 2 Industrial and Commercial Bank of China 295.65
3 3 Bank of America 279.73
4 4 Wells Fargo 214.34
5 5 China Construction Bank 207.98
6 6 Agricultural Bank of China 181.49

And now let’s merge() this two datasets:

merged_data <- merge(assets_table, capital_table, by = "Bank.name")
head(merged_data)
A data.frame: 6 × 5
Bank.name Rank.x Total.assets.2020..US..billion. Rank.y Market.cap.US..billion.
<chr> <int> <chr> <int> <dbl>
1 Agricultural Bank of China 3 4,300.00 6 181.49
2 Australia and New Zealand Banking Group 48 661.72 26 54.88
3 Banco Bilbao Vizcaya Argentaria 42 782.16 37 37.42
4 Banco Bradesco 79 345.21 18 74.67
5 Banco Santander 16 1,702.61 17 75.47
6 Bank of America 8 2,434.08 3 279.73

9.1 Task 3

From a page https://en.wikipedia.org/wiki/List_of_largest_banks read and merge by country named tables:

Solution

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks" # got to url in other tab
#url <- "data/List of largest banks - Wikipedia_.html"
page_data <- read_html(url) # read html content

tables <- html_nodes(page_data, "table")
html_table(tables[1]) #its not needed table
  1. A tibble: 1 2
    X1 X2
    <lgl> <chr>
    NA This article is missing information about Revenue and Employment. Please expand the article to include this information. Further details may exist on the talk page. (September 2020)
html_table(tables[3]) # thats solution for "Number of banks in the top 100 by total assets"
#check the end of table. There are NA record
# lets remove it
  1. A tibble: 26 3
    Rank Country Number
    <int> <chr> <int>
    1 China 19
    2 United States 11
    3 Japan 8
    4 United Kingdom 6
    4 France 6
    4 South Korea 6
    5 Canada 5
    5 Germany 5
    6 Australia 4
    6 Brazil 4
    6 Spain 4
    7 Netherlands 3
    7 Singapore 3
    7 Sweden 3
    7 Switzerland 3
    8 Italy 2
    9 India 1
    9 Austria 1
    9 Belgium 1
    9 Denmark 1
    9 Finland 1
    9 Norway 1
    9 Russia 1
    9 Qatar 1
    9 NA NA
    9 NA NA
table1 <- as.data.frame(html_table(tables[3]))
table1 <- table1[!is.na(table1$Country), ]
table1 # now it OK!
A data.frame: 24 × 3
Rank Country Number
<int> <chr> <int>
1 1 China 19
2 2 United States 11
3 3 Japan 8
4 4 United Kingdom 6
5 4 France 6
6 4 South Korea 6
7 5 Canada 5
8 5 Germany 5
9 6 Australia 4
10 6 Brazil 4
11 6 Spain 4
12 7 Netherlands 3
13 7 Singapore 3
14 7 Sweden 3
15 7 Switzerland 3
16 8 Italy 2
17 9 India 1
18 9 Austria 1
19 9 Belgium 1
20 9 Denmark 1
21 9 Finland 1
22 9 Norway 1
23 9 Russia 1
24 9 Qatar 1
# SOlution for "Total market capital (US$ billion) across the top 70 banks by country"
# compare this with table on a given page
table2 <- as.data.frame(html_table(tables[4]))
table2 # now it OK!
A data.frame: 50 × 3
Rank Bank.name Market.cap.US..billion.
<int> <chr> <dbl>
1 JPMorgan Chase 368.78
2 Industrial and Commercial Bank of China 295.65
3 Bank of America 279.73
4 Wells Fargo 214.34
5 China Construction Bank 207.98
6 Agricultural Bank of China 181.49
7 HSBC Holdings PLC 169.47
8 Citigroup Inc. 163.58
9 Bank of China 151.15
10 China Merchants Bank 133.37
11 Royal Bank of Canada 113.80
12 Toronto-Dominion Bank 106.61
13 Commonwealth Bank 99.77
14 HDFC Bank 105.90
15 U.S. Bancorp 84.40
16 Goldman Sachs 78.70
17 Banco Santander 75.47
18 Banco Bradesco 74.67
19 Morgan Stanley 73.93
20 Westpac 67.84
21 Mitsubishi UFJ Financial Group 66.20
22 Scotiabank 65.48
23 PNC Financial Services 63.11
24 Bank of Communications 61.85
25 BNP Paribas 59.36
26 Australia and New Zealand Banking Group 54.88
27 National Australia Bank 51.68
28 Lloyds Banking Group 51.19
29 Sumitomo Mitsui Financial Group 49.85
30 Bank of Montreal 48.12
31 UBS 45.92
32 ING Group 44.97
33 Capital One 43.22
34 The Bank of New York Mellon 42.58
35 China Minsheng Bank 39.13
36 China CITIC Bank 38.55
37 Banco Bilbao Vizcaya Argentaria 37.42
38 Mizuho Financial Group 36.95
39 Intesa Sanpaolo 36.90
40 Credit Agricole 34.89
41 Canadian Imperial Bank of Commerce 34.87
42 Royal Bank of Scotland 33.95
43 Barclays 33.26
44 Credit Suisse 30.75
45 Nordea 29.59
46 Standard Chartered 29.37
47 KBC Bank 27.40
48 UniCredit 26.88
49 Societe Generale 21.27
50 Deutsche Bank 15.77

9.2 Набори даних

  1. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
  2. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
  3. https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
  4. https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

9.3 References

  1. SQLite in R. Datacamp
  2. Tidyverse googlesheets4 0.2.0
  3. Binanace spot Api Docs
  4. Web Scraping in R: rvest Tutorial by Arvid Kingl