9 Web-pages (HTML)
Sometimes decision making needs scrap data from web sources and pages.
Let’s try to parse data from Wikipedia
as table.
Go to web page https://en.wikipedia.org/wiki/List_of_largest_banks and check it.
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
{xml_nodeset (4)}
[1] <table class="box-Missing_information plainlinks metadata ambox ambox-con ...
[2] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...
[3] <table class="wikitable sortable mw-collapsible">\n<caption>Number of ban ...
[4] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...
For now, let’s read a table of Total Assets in US Billion
# with pipe operator
#tables[2] %>%
# html_table(fill = TRUE) %>%
# as.data.frame()
#without pipe operator
assets_table <- as.data.frame(html_table(tables[2], fill = TRUE))
head(assets_table)
Rank | Bank.name | Total.assets.2020..US..billion. | |
---|---|---|---|
<int> | <chr> | <chr> | |
1 | 1 | Industrial and Commercial Bank of China | 5,518.00 |
2 | 2 | China Construction Bank | 4,400.00 |
3 | 3 | Agricultural Bank of China | 4,300.00 |
4 | 4 | Bank of China | 4,200.00 |
5 | 5 | JPMorgan Chase | 3,831.65 |
6 | 6 | Mitsubishi UFJ Financial Group | 3,175.21 |
Next is reading data of market capitalization table (4th):
Rank | Bank.name | Market.cap.US..billion. | |
---|---|---|---|
<int> | <chr> | <dbl> | |
1 | 1 | JPMorgan Chase | 368.78 |
2 | 2 | Industrial and Commercial Bank of China | 295.65 |
3 | 3 | Bank of America | 279.73 |
4 | 4 | Wells Fargo | 214.34 |
5 | 5 | China Construction Bank | 207.98 |
6 | 6 | Agricultural Bank of China | 181.49 |
And now let’s merge()
this two datasets:
Bank.name | Rank.x | Total.assets.2020..US..billion. | Rank.y | Market.cap.US..billion. | |
---|---|---|---|---|---|
<chr> | <int> | <chr> | <int> | <dbl> | |
1 | Agricultural Bank of China | 3 | 4,300.00 | 6 | 181.49 |
2 | Australia and New Zealand Banking Group | 48 | 661.72 | 26 | 54.88 |
3 | Banco Bilbao Vizcaya Argentaria | 42 | 782.16 | 37 | 37.42 |
4 | Banco Bradesco | 79 | 345.21 | 18 | 74.67 |
5 | Banco Santander | 16 | 1,702.61 | 17 | 75.47 |
6 | Bank of America | 8 | 2,434.08 | 3 | 279.73 |
9.1 Task 3
From a page https://en.wikipedia.org/wiki/List_of_largest_banks read and merge by country
named tables:
Solution
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks" # got to url in other tab
#url <- "data/List of largest banks - Wikipedia_.html"
page_data <- read_html(url) # read html content
tables <- html_nodes(page_data, "table")
html_table(tables[1]) #its not needed table
-
A tibble: 1 2 X1 X2 <lgl> <chr> NA This article is missing information about Revenue and Employment. Please expand the article to include this information. Further details may exist on the talk page. (September 2020)
html_table(tables[3]) # thats solution for "Number of banks in the top 100 by total assets"
#check the end of table. There are NA record
# lets remove it
-
A tibble: 26 3 Rank Country Number <int> <chr> <int> 1 China 19 2 United States 11 3 Japan 8 4 United Kingdom 6 4 France 6 4 South Korea 6 5 Canada 5 5 Germany 5 6 Australia 4 6 Brazil 4 6 Spain 4 7 Netherlands 3 7 Singapore 3 7 Sweden 3 7 Switzerland 3 8 Italy 2 9 India 1 9 Austria 1 9 Belgium 1 9 Denmark 1 9 Finland 1 9 Norway 1 9 Russia 1 9 Qatar 1 9 NA NA 9 NA NA
table1 <- as.data.frame(html_table(tables[3]))
table1 <- table1[!is.na(table1$Country), ]
table1 # now it OK!
Rank | Country | Number | |
---|---|---|---|
<int> | <chr> | <int> | |
1 | 1 | China | 19 |
2 | 2 | United States | 11 |
3 | 3 | Japan | 8 |
4 | 4 | United Kingdom | 6 |
5 | 4 | France | 6 |
6 | 4 | South Korea | 6 |
7 | 5 | Canada | 5 |
8 | 5 | Germany | 5 |
9 | 6 | Australia | 4 |
10 | 6 | Brazil | 4 |
11 | 6 | Spain | 4 |
12 | 7 | Netherlands | 3 |
13 | 7 | Singapore | 3 |
14 | 7 | Sweden | 3 |
15 | 7 | Switzerland | 3 |
16 | 8 | Italy | 2 |
17 | 9 | India | 1 |
18 | 9 | Austria | 1 |
19 | 9 | Belgium | 1 |
20 | 9 | Denmark | 1 |
21 | 9 | Finland | 1 |
22 | 9 | Norway | 1 |
23 | 9 | Russia | 1 |
24 | 9 | Qatar | 1 |
# SOlution for "Total market capital (US$ billion) across the top 70 banks by country"
# compare this with table on a given page
table2 <- as.data.frame(html_table(tables[4]))
table2 # now it OK!
Rank | Bank.name | Market.cap.US..billion. |
---|---|---|
<int> | <chr> | <dbl> |
1 | JPMorgan Chase | 368.78 |
2 | Industrial and Commercial Bank of China | 295.65 |
3 | Bank of America | 279.73 |
4 | Wells Fargo | 214.34 |
5 | China Construction Bank | 207.98 |
6 | Agricultural Bank of China | 181.49 |
7 | HSBC Holdings PLC | 169.47 |
8 | Citigroup Inc. | 163.58 |
9 | Bank of China | 151.15 |
10 | China Merchants Bank | 133.37 |
11 | Royal Bank of Canada | 113.80 |
12 | Toronto-Dominion Bank | 106.61 |
13 | Commonwealth Bank | 99.77 |
14 | HDFC Bank | 105.90 |
15 | U.S. Bancorp | 84.40 |
16 | Goldman Sachs | 78.70 |
17 | Banco Santander | 75.47 |
18 | Banco Bradesco | 74.67 |
19 | Morgan Stanley | 73.93 |
20 | Westpac | 67.84 |
21 | Mitsubishi UFJ Financial Group | 66.20 |
22 | Scotiabank | 65.48 |
23 | PNC Financial Services | 63.11 |
24 | Bank of Communications | 61.85 |
25 | BNP Paribas | 59.36 |
26 | Australia and New Zealand Banking Group | 54.88 |
27 | National Australia Bank | 51.68 |
28 | Lloyds Banking Group | 51.19 |
29 | Sumitomo Mitsui Financial Group | 49.85 |
30 | Bank of Montreal | 48.12 |
31 | UBS | 45.92 |
32 | ING Group | 44.97 |
33 | Capital One | 43.22 |
34 | The Bank of New York Mellon | 42.58 |
35 | China Minsheng Bank | 39.13 |
36 | China CITIC Bank | 38.55 |
37 | Banco Bilbao Vizcaya Argentaria | 37.42 |
38 | Mizuho Financial Group | 36.95 |
39 | Intesa Sanpaolo | 36.90 |
40 | Credit Agricole | 34.89 |
41 | Canadian Imperial Bank of Commerce | 34.87 |
42 | Royal Bank of Scotland | 33.95 |
43 | Barclays | 33.26 |
44 | Credit Suisse | 30.75 |
45 | Nordea | 29.59 |
46 | Standard Chartered | 29.37 |
47 | KBC Bank | 27.40 |
48 | UniCredit | 26.88 |
49 | Societe Generale | 21.27 |
50 | Deutsche Bank | 15.77 |
9.2 Набори даних
- https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
- https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
- https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
- https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml