使用 Rvest 在多个页面上抓取 table
Scraping a table over multiple pages with Rvest
我正在尝试从网站上抓取 table。我设法编写了最少的代码来从 table 获取数据。请参阅下面的代码:
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date, "&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=0")
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data <- data[-1, ]
但是 table 被分成多个页面,每个页面只显示 25 行。
我检查了 ,但不同之处在于,对于我正在使用的 table,link 被修改为起始行号(而不是页码)。
任何关于如何解决这个问题的想法都将不胜感激。
可以使用 URL、&start=
中的最后一个参数逐页遍历结果。搜索结果页面每页呈现 25 个项目,因此页面顺序为 25、50、75、100...
我们将获得前5页的结果,共计125笔交易。由于第一页以 &start=0
开头,我们分配一个向量 startRows
来表示每一页的起始行。
然后我们使用向量来驱动 lapply()
和一个匿名函数,该函数读取数据并对其进行操作以从读取的每一页数据中删除 header 行。
library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date,
"&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)
...输出:
> head(data,n=10)
# A tibble: 10 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1947-08… Bombers … "" "• Jack Underman" fractured legs (in auto accide…
2 1948-02… Bullets … "• Harry Jeannette… "" broken rib (DTD) (date approxi…
3 1949-03… Capitols "" "• Horace McKinney /… personal reasons (DTD)
4 1949-11… Capitols "" "• Fred Scolari" fractured right cheekbone (out…
5 1949-12… Knicks "" "• Vince Boryla" mumps (out ~2 weeks)
6 1950-01… Knicks "• Vince Boryla" "" returned to lineup (date appro…
7 1950-10… Knicks "" "• Goebel Ritter / T… bruised ligaments in left ankl…
8 1950-11… Warriors "" "• Andy Phillip" lacerated foot (DTD)
9 1950-12… Celtics "" "• Andy Duncan (a)" fractured kneecap (out indefin…
10 1951-12… Bullets "" "• Don Barksdale" placed on IL
>
验证结果
我们可以通过打印每页的第一行和最后一行来验证结果,从第 1 页上的最后一次观察开始。
data[c(25,26,50,51,75,76,100,101,125),]
...以及在网站上手动导航时与搜索结果第 1 - 5 页呈现的内容相匹配的输出。
> data[c(25,26,50,51,75,76,100,101,125),]
# A tibble: 9 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1960-01-… Celtics "" "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics "" "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks "• Art Heyma… "" returned to lineup
4 1964-12-… Hawks "• Bob Petti… "" returned to lineup (date approximate)
5 1968-11-… Nets (ABA) "" "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… "" "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers "" "• Earnie Kill… dislocated left foot (out indefinitely)
8 1970-10-… Lakers "" "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers "• Austin Ca… "" returned to lineup
如果我们查看 table 中的最后一页,我们会发现页面系列的最大值为 &start=61475
。生成整个页面序列(2460,与网站搜索结果中列出的页面数相匹配)的 R 代码是:
# generate entire sequence of pages
pages <- c(0,seq(from=25,to=61475,by=25))
...输出:
> head(pages)
[1] 0 25 50 75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475
我正在尝试从网站上抓取 table。我设法编写了最少的代码来从 table 获取数据。请参阅下面的代码:
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date, "&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=0")
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data <- data[-1, ]
但是 table 被分成多个页面,每个页面只显示 25 行。
我检查了
任何关于如何解决这个问题的想法都将不胜感激。
可以使用 URL、&start=
中的最后一个参数逐页遍历结果。搜索结果页面每页呈现 25 个项目,因此页面顺序为 25、50、75、100...
我们将获得前5页的结果,共计125笔交易。由于第一页以 &start=0
开头,我们分配一个向量 startRows
来表示每一页的起始行。
然后我们使用向量来驱动 lapply()
和一个匿名函数,该函数读取数据并对其进行操作以从读取的每一页数据中删除 header 行。
library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date,
"&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)
...输出:
> head(data,n=10)
# A tibble: 10 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1947-08… Bombers … "" "• Jack Underman" fractured legs (in auto accide…
2 1948-02… Bullets … "• Harry Jeannette… "" broken rib (DTD) (date approxi…
3 1949-03… Capitols "" "• Horace McKinney /… personal reasons (DTD)
4 1949-11… Capitols "" "• Fred Scolari" fractured right cheekbone (out…
5 1949-12… Knicks "" "• Vince Boryla" mumps (out ~2 weeks)
6 1950-01… Knicks "• Vince Boryla" "" returned to lineup (date appro…
7 1950-10… Knicks "" "• Goebel Ritter / T… bruised ligaments in left ankl…
8 1950-11… Warriors "" "• Andy Phillip" lacerated foot (DTD)
9 1950-12… Celtics "" "• Andy Duncan (a)" fractured kneecap (out indefin…
10 1951-12… Bullets "" "• Don Barksdale" placed on IL
>
验证结果
我们可以通过打印每页的第一行和最后一行来验证结果,从第 1 页上的最后一次观察开始。
data[c(25,26,50,51,75,76,100,101,125),]
...以及在网站上手动导航时与搜索结果第 1 - 5 页呈现的内容相匹配的输出。
> data[c(25,26,50,51,75,76,100,101,125),]
# A tibble: 9 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1960-01-… Celtics "" "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics "" "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks "• Art Heyma… "" returned to lineup
4 1964-12-… Hawks "• Bob Petti… "" returned to lineup (date approximate)
5 1968-11-… Nets (ABA) "" "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… "" "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers "" "• Earnie Kill… dislocated left foot (out indefinitely)
8 1970-10-… Lakers "" "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers "• Austin Ca… "" returned to lineup
如果我们查看 table 中的最后一页,我们会发现页面系列的最大值为 &start=61475
。生成整个页面序列(2460,与网站搜索结果中列出的页面数相匹配)的 R 代码是:
# generate entire sequence of pages
pages <- c(0,seq(from=25,to=61475,by=25))
...输出:
> head(pages)
[1] 0 25 50 75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475