如何从 URL 中只有月份和年份的网站的多个表格中提取内容
How to extract contents from multiple tables from website with only month and year in URL
这是对我之前在此处提出的问题的跟进:
我试图从 div 标签之间提取数据的页面来自此站点:
http://bigbashboard.com/rankings/batsmen
这是与我之前的问题不同的页面(尽管它仍然是同一个站点)。主要区别在于 URL 中出现的日期仅显示为 year/month,如下所示:
http://bigbashboard.com/rankings/batsmen/2020/10
与我上一个问题中以 year/month/天出现的页面相反:
http://bigbashboard.com/rankings/bbl/batsmen/2020/01/08
我仍然希望从页面左侧提取相同的数据,这些数据出现在 div 标签之间,如下所示:
击球手
1 Lokesh Rahul 167
2 Ravija Sanaruwan 150
3 David Warner 143
我还需要右侧 table 中出现的数据,并将它们绑定在一起,看起来像这样,包括页面的来源日期,如下所示:
Date Rank Name Points Dates I R HS Ave SR 4s 6s 100s 50s
Oct-20 1 Lokesh Rahul 167 Nov 2018 - Oct 2020 47 1910 132 50.26 141.38 171 76 2 17
Oct-20 2 Ravija Sanaruwan 150 Jan 2019 - Feb 2020 15 577 103 44.38 165.80 52 36 1 4
Oct-20 3 David Warner 143 Jan 2019 - Sep 2020 33 1475 100 61.46 138.89 128 39 2 16
我尝试使用之前post中提供的代码作为解决方案:
library(rvest)
library(xml2)
library(dplyr)
library(furrr)
batsmen <- function(x) {
x <- html_nodes(x, "div.cf.rankings-page div div ol li a")
xml_remove(html_nodes(x, "span.rank small, span[class^='pos'] em"))
score <- html_text(html_nodes(x, "span.rank"))
rank <- html_text(html_nodes(x, "span[class^='pos']"), trim = TRUE)
xml_remove(html_nodes(x, "span"))
tibble(Rank = rank, Name = html_text(x), Points = score)
}
stats_table <- function(x) {
as_tibble(html_table(x)[[1L]])
}
read_rankings <- function(url) {
ymd <- as.Date(paste0(tail(strsplit(url, "/")[[1L]], 3L), collapse = "-"))
read_html(url) %>% {bind_cols(Date = ymd, batsmen(.), stats_table(.))}
}
mas_url <- "http://bigbashboard.com/rankings/batsmen"
timeline <-
read_html(mas_url) %>%
html_nodes("div.timeline span a") %>%
html_attr("href") %>%
url_absolute(mas_url)
# Use parallel processing for speed.
plan(multiprocess)
future_map_dfr(timeline[1:100], read_rankings) # I only scrape a few links for test.
但是,这会产生错误:
Error in charToDate(x) :
character string is not in a standard unambiguous format
我不明白为什么会出现这种情况以及如何解决它。我假设这可能是因为日期格式不同。
下面的代码适用于所有三种情况
library(rvest)
library(xml2)
library(dplyr)
library(furrr)
batsmen <- function(x) {
nms <- html_attr(html_nodes(x, "div.cf > a"), "name")
x <- html_nodes(x, "div.cf.rankings-page")
xml_remove(html_nodes(x, "li span.rank small, li span[class^='pos'] em"))
x <- Map(function(i, nm) {
i <- html_nodes(i, "li a")
score <- html_text(html_nodes(i, "span.rank"))
rank <- html_text(html_nodes(i, "span[class^='pos']"), trim = TRUE)
xml_remove(html_nodes(i, "span"))
tibble(Title = nm, Rank = rank, Name = html_text(i), Points = score)
}, x, nms)
bind_rows(x)
}
stats_table <- function(x) {
as_tibble(bind_rows(
lapply(html_table(x), function(df) setNames(df, make.unique(names(df))))
))
}
timeline <- function(mas_url) {
links <- read_html(mas_url) %>% html_nodes("div.timeline span a")
out <- links %>% html_attr("href") %>% url_absolute(mas_url)
setNames(out, html_text(links))
}
read_rankings <- function(url, time) {
read_html(url) %>% {bind_cols(Date = time, batsmen(.), stats_table(.))}
}
# Use parallel processing for speed.
plan(multiprocess)
案例一:该页面只有男性排名
# men only
future_imap_dfr(timeline("http://bigbashboard.com/rankings/bbl/batsmen")[1:10], ~read_rankings(.x, .y))
输出
# A tibble: 996 x 15
Date Title Rank Name Points Dates I R HS Ave SR `4s` `6s` `100s` `50s`
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int>
1 8 Feb '20 men 1 Matthew Wade 125 22 Dec 2018 - 30 Jan 2020 23 943 130 44.9 155. 78 36 1 9
2 8 Feb '20 men 2 Marcus Stoinis 120 21 Dec 2018 - 08 Feb 2020 30 1238 147 53.8 134. 111 39 1 10
3 8 Feb '20 men 3 D'Arcy Short 116 22 Dec 2018 - 30 Jan 2020 24 994 103 49.7 137. 93 36 1 9
4 8 Feb '20 men 4 Alex Hales 115 17 Dec 2019 - 06 Feb 2020 17 576 85 38.4 147. 59 23 0 6
5 8 Feb '20 men 5 Aaron Finch 89 07 Jan 2019 - 27 Jan 2020 17 583 109 36.4 130. 41 24 1 4
6 8 Feb '20 men 6 Josh Inglis 87 26 Dec 2018 - 26 Jan 2020 18 517 73 28.7 149. 53 19 0 5
7 8 Feb '20 men 7 Travis Head 87 11 Jan 2019 - 01 Feb 2020 10 291 79 29.1 132. 22 13 0 1
8 8 Feb '20 men 8 Josh Philippe 84 22 Dec 2018 - 08 Feb 2020 31 791 86 34.4 140. 76 23 0 7
9 8 Feb '20 men 9 Shaun Marsh 82 24 Jan 2019 - 21 Jan 2020 15 547 96 39.1 128. 45 19 0 4
10 8 Feb '20 men 10 Chris Lynn 78 19 Dec 2018 - 27 Jan 2020 27 772 94 32.2 137. 64 44 0 6
# ... with 986 more rows
案例二:男女排名同页
# men and women
future_imap_dfr(timeline("http://bigbashboard.com/rankings/batsmen")[1:10], ~read_rankings(.x, .y))
# A tibble: 2,000 x 15
Date Title Rank Name Points Dates I R HS Ave SR `4s` `6s` `100s` `50s`
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int>
1 Oct '20 men 1 Lokesh Rahul 167 Nov 2018 - Oct 2020 47 1910 132 50.3 141. 171 76 2 17
2 Oct '20 men 2 Ravija Sandaruwan 150 Jan 2019 - Feb 2020 15 577 103 44.4 166. 52 36 1 4
3 Oct '20 men 3 David Warner 143 Jan 2019 - Sep 2020 33 1475 100 61.5 139. 128 39 2 16
4 Oct '20 men 4 Kamran Khan 135 Jan 2019 - Feb 2020 21 630 88 31.5 135. 50 39 0 5
5 Oct '20 men 5 Devdutt Padikkal 135 Nov 2019 - Sep 2020 15 691 122 57.6 167. 72 35 1 7
6 Oct '20 men 6 Devon Conway 121 Dec 2018 - Jan 2020 20 906 105 56.6 145. 113 19 2 5
7 Oct '20 men 7 Jos Buttler 121 Oct 2018 - Oct 2020 23 817 89 37.1 145. 93 32 0 8
8 Oct '20 men 8 Virat Kohli 119 Nov 2018 - Sep 2020 35 1174 100 40.5 141. 90 43 1 8
9 Oct '20 men 9 Kevin O'Brien 119 Oct 2018 - Sep 2020 38 1145 124 31.0 158. 107 59 1 5
10 Oct '20 men 10 Eoin Morgan 118 Oct 2018 - Oct 2020 34 1008 91 38.8 165. 69 66 0 8
# ... with 1,990 more rows
案例三:全能选手
# all-rounders
future_imap_dfr(timeline("http://bigbashboard.com/rankings/bbl/all-rounders")[1:10], ~read_rankings(.x, .y))
# A tibble: 547 x 13
Date Title Rank Name Points Dates M R Ave SR W Econ Ave.1
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 8 Feb '20 men 1 D'Arcy Short 70 22 Dec 2018 - 30 Jan 2020 24 994 49.7 137. 16 8.61 29.1
2 8 Feb '20 men 2 Travis Head 49 11 Jan 2019 - 01 Feb 2020 11 291 29.1 132. 4 8.08 24.2
3 8 Feb '20 men 3 Mohammad Nabi 40 20 Dec 2018 - 27 Jan 2020 20 388 29.8 129. 13 7.9 30.4
4 8 Feb '20 men 4 Chris Morris 38 21 Dec 2019 - 06 Feb 2020 15 112 12.4 147. 22 8.01 19.4
5 8 Feb '20 men 5 Glenn Maxwell 37 21 Dec 2018 - 08 Feb 2020 30 729 36.4 146. 13 7.36 31.2
6 8 Feb '20 men 6 Daniel Sams 35 21 Dec 2018 - 06 Feb 2020 31 230 9.2 119. 45 8.19 17.3
7 8 Feb '20 men 7 Ben Cutting 33 19 Dec 2018 - 27 Jan 2020 28 466 24.5 137. 23 8.92 27.5
8 8 Feb '20 men 8 Mitchell Marsh 28 20 Dec 2018 - 26 Jan 2020 21 504 31.5 132. 6 9.56 43
9 8 Feb '20 men 9 Daniel Christian 27 20 Dec 2018 - 27 Jan 2020 30 382 21.2 124. 20 8.02 27.2
10 8 Feb '20 men 10 Rashid Khan 26 19 Dec 2018 - 01 Feb 2020 29 217 14.5 158. 38 6.65 19.5
# ... with 537 more rows
问答
How does the date work?
新代码从网站上的同一时间线抓取 link 和日期。 Link 是那个 href 属性;日期是文本。请参阅 timeline
函数。这样,我就避免了使用 URL 来获取日期。
Why did I encounter this Error: Can't recycle 'Date' (size 200) to match '..3' (size 190)?
因为有如下表格(也看这个link)
这与您的描述不同,即排名表和统计信息表始终具有相同的行数。
这是对我之前在此处提出的问题的跟进:
我试图从 div 标签之间提取数据的页面来自此站点:
http://bigbashboard.com/rankings/batsmen
这是与我之前的问题不同的页面(尽管它仍然是同一个站点)。主要区别在于 URL 中出现的日期仅显示为 year/month,如下所示:
http://bigbashboard.com/rankings/batsmen/2020/10
与我上一个问题中以 year/month/天出现的页面相反:
http://bigbashboard.com/rankings/bbl/batsmen/2020/01/08
我仍然希望从页面左侧提取相同的数据,这些数据出现在 div 标签之间,如下所示:
击球手
1 Lokesh Rahul 167
2 Ravija Sanaruwan 150
3 David Warner 143
我还需要右侧 table 中出现的数据,并将它们绑定在一起,看起来像这样,包括页面的来源日期,如下所示:
Date Rank Name Points Dates I R HS Ave SR 4s 6s 100s 50s
Oct-20 1 Lokesh Rahul 167 Nov 2018 - Oct 2020 47 1910 132 50.26 141.38 171 76 2 17
Oct-20 2 Ravija Sanaruwan 150 Jan 2019 - Feb 2020 15 577 103 44.38 165.80 52 36 1 4
Oct-20 3 David Warner 143 Jan 2019 - Sep 2020 33 1475 100 61.46 138.89 128 39 2 16
我尝试使用之前post中提供的代码作为解决方案:
library(rvest)
library(xml2)
library(dplyr)
library(furrr)
batsmen <- function(x) {
x <- html_nodes(x, "div.cf.rankings-page div div ol li a")
xml_remove(html_nodes(x, "span.rank small, span[class^='pos'] em"))
score <- html_text(html_nodes(x, "span.rank"))
rank <- html_text(html_nodes(x, "span[class^='pos']"), trim = TRUE)
xml_remove(html_nodes(x, "span"))
tibble(Rank = rank, Name = html_text(x), Points = score)
}
stats_table <- function(x) {
as_tibble(html_table(x)[[1L]])
}
read_rankings <- function(url) {
ymd <- as.Date(paste0(tail(strsplit(url, "/")[[1L]], 3L), collapse = "-"))
read_html(url) %>% {bind_cols(Date = ymd, batsmen(.), stats_table(.))}
}
mas_url <- "http://bigbashboard.com/rankings/batsmen"
timeline <-
read_html(mas_url) %>%
html_nodes("div.timeline span a") %>%
html_attr("href") %>%
url_absolute(mas_url)
# Use parallel processing for speed.
plan(multiprocess)
future_map_dfr(timeline[1:100], read_rankings) # I only scrape a few links for test.
但是,这会产生错误:
Error in charToDate(x) :
character string is not in a standard unambiguous format
我不明白为什么会出现这种情况以及如何解决它。我假设这可能是因为日期格式不同。
下面的代码适用于所有三种情况
library(rvest)
library(xml2)
library(dplyr)
library(furrr)
batsmen <- function(x) {
nms <- html_attr(html_nodes(x, "div.cf > a"), "name")
x <- html_nodes(x, "div.cf.rankings-page")
xml_remove(html_nodes(x, "li span.rank small, li span[class^='pos'] em"))
x <- Map(function(i, nm) {
i <- html_nodes(i, "li a")
score <- html_text(html_nodes(i, "span.rank"))
rank <- html_text(html_nodes(i, "span[class^='pos']"), trim = TRUE)
xml_remove(html_nodes(i, "span"))
tibble(Title = nm, Rank = rank, Name = html_text(i), Points = score)
}, x, nms)
bind_rows(x)
}
stats_table <- function(x) {
as_tibble(bind_rows(
lapply(html_table(x), function(df) setNames(df, make.unique(names(df))))
))
}
timeline <- function(mas_url) {
links <- read_html(mas_url) %>% html_nodes("div.timeline span a")
out <- links %>% html_attr("href") %>% url_absolute(mas_url)
setNames(out, html_text(links))
}
read_rankings <- function(url, time) {
read_html(url) %>% {bind_cols(Date = time, batsmen(.), stats_table(.))}
}
# Use parallel processing for speed.
plan(multiprocess)
案例一:该页面只有男性排名
# men only
future_imap_dfr(timeline("http://bigbashboard.com/rankings/bbl/batsmen")[1:10], ~read_rankings(.x, .y))
输出
# A tibble: 996 x 15
Date Title Rank Name Points Dates I R HS Ave SR `4s` `6s` `100s` `50s`
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int>
1 8 Feb '20 men 1 Matthew Wade 125 22 Dec 2018 - 30 Jan 2020 23 943 130 44.9 155. 78 36 1 9
2 8 Feb '20 men 2 Marcus Stoinis 120 21 Dec 2018 - 08 Feb 2020 30 1238 147 53.8 134. 111 39 1 10
3 8 Feb '20 men 3 D'Arcy Short 116 22 Dec 2018 - 30 Jan 2020 24 994 103 49.7 137. 93 36 1 9
4 8 Feb '20 men 4 Alex Hales 115 17 Dec 2019 - 06 Feb 2020 17 576 85 38.4 147. 59 23 0 6
5 8 Feb '20 men 5 Aaron Finch 89 07 Jan 2019 - 27 Jan 2020 17 583 109 36.4 130. 41 24 1 4
6 8 Feb '20 men 6 Josh Inglis 87 26 Dec 2018 - 26 Jan 2020 18 517 73 28.7 149. 53 19 0 5
7 8 Feb '20 men 7 Travis Head 87 11 Jan 2019 - 01 Feb 2020 10 291 79 29.1 132. 22 13 0 1
8 8 Feb '20 men 8 Josh Philippe 84 22 Dec 2018 - 08 Feb 2020 31 791 86 34.4 140. 76 23 0 7
9 8 Feb '20 men 9 Shaun Marsh 82 24 Jan 2019 - 21 Jan 2020 15 547 96 39.1 128. 45 19 0 4
10 8 Feb '20 men 10 Chris Lynn 78 19 Dec 2018 - 27 Jan 2020 27 772 94 32.2 137. 64 44 0 6
# ... with 986 more rows
案例二:男女排名同页
# men and women
future_imap_dfr(timeline("http://bigbashboard.com/rankings/batsmen")[1:10], ~read_rankings(.x, .y))
# A tibble: 2,000 x 15
Date Title Rank Name Points Dates I R HS Ave SR `4s` `6s` `100s` `50s`
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int>
1 Oct '20 men 1 Lokesh Rahul 167 Nov 2018 - Oct 2020 47 1910 132 50.3 141. 171 76 2 17
2 Oct '20 men 2 Ravija Sandaruwan 150 Jan 2019 - Feb 2020 15 577 103 44.4 166. 52 36 1 4
3 Oct '20 men 3 David Warner 143 Jan 2019 - Sep 2020 33 1475 100 61.5 139. 128 39 2 16
4 Oct '20 men 4 Kamran Khan 135 Jan 2019 - Feb 2020 21 630 88 31.5 135. 50 39 0 5
5 Oct '20 men 5 Devdutt Padikkal 135 Nov 2019 - Sep 2020 15 691 122 57.6 167. 72 35 1 7
6 Oct '20 men 6 Devon Conway 121 Dec 2018 - Jan 2020 20 906 105 56.6 145. 113 19 2 5
7 Oct '20 men 7 Jos Buttler 121 Oct 2018 - Oct 2020 23 817 89 37.1 145. 93 32 0 8
8 Oct '20 men 8 Virat Kohli 119 Nov 2018 - Sep 2020 35 1174 100 40.5 141. 90 43 1 8
9 Oct '20 men 9 Kevin O'Brien 119 Oct 2018 - Sep 2020 38 1145 124 31.0 158. 107 59 1 5
10 Oct '20 men 10 Eoin Morgan 118 Oct 2018 - Oct 2020 34 1008 91 38.8 165. 69 66 0 8
# ... with 1,990 more rows
案例三:全能选手
# all-rounders
future_imap_dfr(timeline("http://bigbashboard.com/rankings/bbl/all-rounders")[1:10], ~read_rankings(.x, .y))
# A tibble: 547 x 13
Date Title Rank Name Points Dates M R Ave SR W Econ Ave.1
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 8 Feb '20 men 1 D'Arcy Short 70 22 Dec 2018 - 30 Jan 2020 24 994 49.7 137. 16 8.61 29.1
2 8 Feb '20 men 2 Travis Head 49 11 Jan 2019 - 01 Feb 2020 11 291 29.1 132. 4 8.08 24.2
3 8 Feb '20 men 3 Mohammad Nabi 40 20 Dec 2018 - 27 Jan 2020 20 388 29.8 129. 13 7.9 30.4
4 8 Feb '20 men 4 Chris Morris 38 21 Dec 2019 - 06 Feb 2020 15 112 12.4 147. 22 8.01 19.4
5 8 Feb '20 men 5 Glenn Maxwell 37 21 Dec 2018 - 08 Feb 2020 30 729 36.4 146. 13 7.36 31.2
6 8 Feb '20 men 6 Daniel Sams 35 21 Dec 2018 - 06 Feb 2020 31 230 9.2 119. 45 8.19 17.3
7 8 Feb '20 men 7 Ben Cutting 33 19 Dec 2018 - 27 Jan 2020 28 466 24.5 137. 23 8.92 27.5
8 8 Feb '20 men 8 Mitchell Marsh 28 20 Dec 2018 - 26 Jan 2020 21 504 31.5 132. 6 9.56 43
9 8 Feb '20 men 9 Daniel Christian 27 20 Dec 2018 - 27 Jan 2020 30 382 21.2 124. 20 8.02 27.2
10 8 Feb '20 men 10 Rashid Khan 26 19 Dec 2018 - 01 Feb 2020 29 217 14.5 158. 38 6.65 19.5
# ... with 537 more rows
问答
How does the date work?
新代码从网站上的同一时间线抓取 link 和日期。 Link 是那个 href 属性;日期是文本。请参阅 timeline
函数。这样,我就避免了使用 URL 来获取日期。
Why did I encounter this Error: Can't recycle 'Date' (size 200) to match '..3' (size 190)?
因为有如下表格(也看这个link)
这与您的描述不同,即排名表和统计信息表始终具有相同的行数。