9 Web-pages (HTML)

Автор

Юрій Клебан

Sometimes decision making needs scrap data from web sources and pages.

Let’s try to parse data from Wikipedia as table.

#install.packages("rvest")
library(rvest) # Parsing of HTML/XML files

Go to web page https://en.wikipedia.org/wiki/List_of_largest_banks and check it.

# fix URL
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks"
#url <- "data/List of largest banks - Wikipedia_.html"

# read html content of the page
page <- read_html(url)
page

{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

# read all yables on page
tables <- html_nodes(page, "table")
tables

{xml_nodeset (4)}
[1] <table class="box-Missing_information plainlinks metadata ambox ambox-con ...
[2] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...
[3] <table class="wikitable sortable mw-collapsible">\n<caption>Number of ban ...
[4] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...

For now, let’s read a table of Total Assets in US Billion

# with pipe operator
#tables[2] %>% 
 #   html_table(fill = TRUE) %>% 
 #   as.data.frame()
#without pipe operator
assets_table <- as.data.frame(html_table(tables[2], fill = TRUE))   
head(assets_table)

A data.frame: 6 × 3
	Rank	Bank.name	Total.assets.2020..US..billion.
	<int>	<chr>	<chr>
1	1	Industrial and Commercial Bank of China	5,518.00
2	2	China Construction Bank	4,400.00
3	3	Agricultural Bank of China	4,300.00
4	4	Bank of China	4,200.00
5	5	JPMorgan Chase	3,831.65
6	6	Mitsubishi UFJ Financial Group	3,175.21

Next is reading data of market capitalization table (4th):

capital_table <- as.data.frame(html_table(tables[4], fill = TRUE))   
head(capital_table)

A data.frame: 6 × 3
	Rank	Bank.name	Market.cap.US..billion.
	<int>	<chr>	<dbl>
1	1	JPMorgan Chase	368.78
2	2	Industrial and Commercial Bank of China	295.65
3	3	Bank of America	279.73
4	4	Wells Fargo	214.34
5	5	China Construction Bank	207.98
6	6	Agricultural Bank of China	181.49

And now let’s merge() this two datasets:

merged_data <- merge(assets_table, capital_table, by = "Bank.name")
head(merged_data)

A data.frame: 6 × 5
	Bank.name	Rank.x	Total.assets.2020..US..billion.	Rank.y	Market.cap.US..billion.
	<chr>	<int>	<chr>	<int>	<dbl>
1	Agricultural Bank of China	3	4,300.00	6	181.49
2	Australia and New Zealand Banking Group	48	661.72	26	54.88
3	Banco Bilbao Vizcaya Argentaria	42	782.16	37	37.42
4	Banco Bradesco	79	345.21	18	74.67
5	Banco Santander	16	1,702.61	17	75.47
6	Bank of America	8	2,434.08	3	279.73

9.1 Task 3

From a page https://en.wikipedia.org/wiki/List_of_largest_banks read and merge by country named tables:

Number of banks in the top 100 by total assets
Total market capital (US$ billion) across the top 70 banks by country

Solution

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks" # got to url in other tab
#url <- "data/List of largest banks - Wikipedia_.html"
page_data <- read_html(url) # read html content

tables <- html_nodes(page_data, "table")
html_table(tables[1]) #its not needed table

A tibble: 1 2
X1	X2
<lgl>	<chr>
NA	This article is missing information about Revenue and Employment. Please expand the article to include this information. Further details may exist on the talk page. (September 2020)

html_table(tables[3]) # thats solution for "Number of banks in the top 100 by total assets"
#check the end of table. There are NA record
# lets remove it

A tibble: 26 3
Rank	Country	Number
<int>	<chr>	<int>
1	China	19
2	United States	11
3	Japan	8
4	United Kingdom	6
4	France	6
4	South Korea	6
5	Canada	5
5	Germany	5
6	Australia	4
6	Brazil	4
6	Spain	4
7	Netherlands	3
7	Singapore	3
7	Sweden	3
7	Switzerland	3
8	Italy	2
9	India	1
9	Austria	1
9	Belgium	1
9	Denmark	1
9	Finland	1
9	Norway	1
9	Russia	1
9	Qatar	1
9	NA	NA
9	NA	NA

table1 <- as.data.frame(html_table(tables[3]))
table1 <- table1[!is.na(table1$Country), ]
table1 # now it OK!

A data.frame: 24 × 3
	Rank	Country	Number
	<int>	<chr>	<int>
1	1	China	19
2	2	United States	11
3	3	Japan	8
4	4	United Kingdom	6
5	4	France	6
6	4	South Korea	6
7	5	Canada	5
8	5	Germany	5
9	6	Australia	4
10	6	Brazil	4
11	6	Spain	4
12	7	Netherlands	3
13	7	Singapore	3
14	7	Sweden	3
15	7	Switzerland	3
16	8	Italy	2
17	9	India	1
18	9	Austria	1
19	9	Belgium	1
20	9	Denmark	1
21	9	Finland	1
22	9	Norway	1
23	9	Russia	1
24	9	Qatar	1

# SOlution for "Total market capital (US$ billion) across the top 70 banks by country"
# compare this with table on a given page
table2 <- as.data.frame(html_table(tables[4]))
table2 # now it OK!

A data.frame: 50 × 3
Rank	Bank.name	Market.cap.US..billion.
<int>	<chr>	<dbl>
1	JPMorgan Chase	368.78
2	Industrial and Commercial Bank of China	295.65
3	Bank of America	279.73
4	Wells Fargo	214.34
5	China Construction Bank	207.98
6	Agricultural Bank of China	181.49
7	HSBC Holdings PLC	169.47
8	Citigroup Inc.	163.58
9	Bank of China	151.15
10	China Merchants Bank	133.37
11	Royal Bank of Canada	113.80
12	Toronto-Dominion Bank	106.61
13	Commonwealth Bank	99.77
14	HDFC Bank	105.90
15	U.S. Bancorp	84.40
16	Goldman Sachs	78.70
17	Banco Santander	75.47
18	Banco Bradesco	74.67
19	Morgan Stanley	73.93
20	Westpac	67.84
21	Mitsubishi UFJ Financial Group	66.20
22	Scotiabank	65.48
23	PNC Financial Services	63.11
24	Bank of Communications	61.85
25	BNP Paribas	59.36
26	Australia and New Zealand Banking Group	54.88
27	National Australia Bank	51.68
28	Lloyds Banking Group	51.19
29	Sumitomo Mitsui Financial Group	49.85
30	Bank of Montreal	48.12
31	UBS	45.92
32	ING Group	44.97
33	Capital One	43.22
34	The Bank of New York Mellon	42.58
35	China Minsheng Bank	39.13
36	China CITIC Bank	38.55
37	Banco Bilbao Vizcaya Argentaria	37.42
38	Mizuho Financial Group	36.95
39	Intesa Sanpaolo	36.90
40	Credit Agricole	34.89
41	Canadian Imperial Bank of Commerce	34.87
42	Royal Bank of Scotland	33.95
43	Barclays	33.26
44	Credit Suisse	30.75
45	Nordea	29.59
46	Standard Chartered	29.37
47	KBC Bank	27.40
48	UniCredit	26.88
49	Societe Generale	21.27
50	Deutsche Bank	15.77

9.2 Набори даних

https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

9.3 References

SQLite in R. Datacamp
Tidyverse googlesheets4 0.2.0
Binanace spot Api Docs
Web Scraping in R: rvest Tutorial by Arvid Kingl