一个网站有一个 URL 的列表,我需要编写一个循环来访问每个 URL 并抓取两个表
A website has a list of URLs, I need to write a loop that accesses each URL and scrapes two tables
我试图最终从 R 中的几个不同 URLs(在同一父站点内)中抓取 tables。
首先,我假设我必须从 https://www.basketball-reference.com/playoffs/NBA_2017.html 中抓取 "Playoff Series" 下的各个游戏链接——那个 table 链接的 xpath 是 //*[@id= "all_all_playoffs"]
然后,我想从每个单独的游戏链接中抓取 tables(看起来像这样:https://www.basketball-reference.com/boxscores/201705170BOS.html)——我想要的 tables 是 [=每个团队 26=]。
(我计划在不同的年份重复此操作,因此每次输入 URL——就像我在下面做的——效率不高)
到目前为止,我只能弄清楚如何一次从一个 url(或一个游戏)中抓取 tables:
games <- c("201705190BOS","201705190BOS","201705210CLE","201705230CLE","201705250BOS")
urls <- paste0("https://www.basketball-reference.com/boxscores/", games, ".html")
get_table <- function(url) {
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="div_box_cle_basic"]/table[1]') %>%
html_nodes(xpath = '//*[@id="div_box_bos_basic"]/table[1]') %>%
html_table()
}
results <- sapply(urls, get_table)
您是否希望自动解析网站上所有 games
的游戏 ID?如果是这样,在将它们输入您的 table 解析器之前,您需要构建一个单独的抓取器来获取游戏 ID。
我会这样做:
Select 一个开始日期,然后每天迭代地 ping 每个站点,可以使用 readLines
从以下位置拉回每个日期的 html 字符串:
https://www.basketball-reference.com/boxscores/?month=11&day=4&year=2017
所以只需遍历 link
中的月、日和年
从上面的 link 中,找到 hyperlink final
下的项目,或者在 HTML 文本 <a href="/boxscores/201711040DEN.html">Final</a>
.
可以使用正则表达式来解析每一行并搜索如下内容:
grep('.*<a href=\"/boxscores/.*.html\">Final</a>.*', [object], value = TRUE) %>%
gsub('.*<a href=\"(/boxscores/.*.html)\">Final</a>.*', '\1', .)
这将构建 link 游戏,然后您可以将其输入上面的解析器。
这对我有用,试试吧!
library(rvest)
page <- read_html('https://www.basketball-reference.com/playoffs/NBA_2017.html')
#get all links in the playoff section
playoffs <- page %>%
html_node('#div_all_playoffs') %>%
html_nodes('a') %>%
html_attr('href')
#limit to those that are actually links to boxscores
playoffs <- playoffs[grep('boxscore', playoffs)]
#loop to scrape each game
allGames <- list()
for(j in 1:length(playoffs)){
box <- read_html(paste0('https://www.basketball-reference.com/', playoffs[j]))
#tables are named based on which team is there, get all html id's to find which one we want
atrs <- box %>%
html_nodes('div') %>%
html_attr('id')
#limit to only names that include "basic" and "all"
basicIds <- atrs[grep('basic', atrs)] %>%
.[grep('all', .)]
#loop to scrape both tables (1 for each team)
teams <- list()
for(i in 1:length(basicIds)){
#grab table for team
table <- box %>%
html_node(paste0('#',basicIds[i])) %>%
html_node('.stats_table') %>%
html_table()
#parse table into starters and reserves tables
startReserve <- which(table[,1] == 'Reserves')
starters <- table[2:(startReserve-1),]
colnames(starters) <- table[1,]
reserves <- table[(startReserve + 1):nrow(table),]
colnames(reserves) <- table[startReserve,]
#extract team name
team <- gsub('all_box_(.+)_basic', '\1', basicIds[i])
#make named list using team name
assign(team, setNames(list(starters, reserves), c('starters', 'reserves')))
teams[[i]] <- team
}
#find game identifier
game <- gsub('/boxscores/(.+).html', '\1', playoffs[j])
#make list of both teams, name list using game identifier
assign(paste0('game_',game), setNames(list(eval(parse(text=teams[[1]])), eval(parse(text=teams[[2]]))), c(teams[[1]], teams[[2]])))
#add to allGames
allGames <- append(allGames, setNames(list(eval(parse(text = paste0('game_', game)))), paste0('game_', game)))
}
#clean up everything but allGames
rm(list = ls()[-grep('allGames', ls())])
输出是一个列表列表。这不是很好,但您想要的数据本质上是分层的:每场比赛有 2 支球队,每支球队有 2 tables(首发和替补)。所以,最终对象看起来像:
-allGames
----游戏1
--------团队 1
------------首发
------------预备队
--------Team2
------------首发
------------预备队
----游戏2
...
例如,在总决赛的最后一场比赛中显示 table 克利夫兰首发的数据:
> allGames$game_201706120GSW$cle$starters
Starters MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS +/-
2 LeBron James 46:13 19 30 .633 2 5 .400 1 4 .250 2 11 13 8 2 1 2 3 41 -13
3 Kyrie Irving 41:47 9 22 .409 1 2 .500 7 7 1.000 1 1 2 6 2 0 4 3 26 +4
4 J.R. Smith 40:49 9 11 .818 7 8 .875 0 1 .000 0 3 3 1 0 2 0 2 25 -2
5 Kevin Love 29:55 2 8 .250 0 3 .000 2 5 .400 3 7 10 2 0 1 0 2 6 -23
6 Tristan Thompson 29:52 6 8 .750 0 0 3 4 .750 4 4 8 3 1 1 3 1 15 -7
我试图最终从 R 中的几个不同 URLs(在同一父站点内)中抓取 tables。
首先,我假设我必须从 https://www.basketball-reference.com/playoffs/NBA_2017.html 中抓取 "Playoff Series" 下的各个游戏链接——那个 table 链接的 xpath 是 //*[@id= "all_all_playoffs"]
然后,我想从每个单独的游戏链接中抓取 tables(看起来像这样:https://www.basketball-reference.com/boxscores/201705170BOS.html)——我想要的 tables 是 [=每个团队 26=]。
(我计划在不同的年份重复此操作,因此每次输入 URL——就像我在下面做的——效率不高)
到目前为止,我只能弄清楚如何一次从一个 url(或一个游戏)中抓取 tables:
games <- c("201705190BOS","201705190BOS","201705210CLE","201705230CLE","201705250BOS")
urls <- paste0("https://www.basketball-reference.com/boxscores/", games, ".html")
get_table <- function(url) {
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="div_box_cle_basic"]/table[1]') %>%
html_nodes(xpath = '//*[@id="div_box_bos_basic"]/table[1]') %>%
html_table()
}
results <- sapply(urls, get_table)
您是否希望自动解析网站上所有 games
的游戏 ID?如果是这样,在将它们输入您的 table 解析器之前,您需要构建一个单独的抓取器来获取游戏 ID。
我会这样做:
Select 一个开始日期,然后每天迭代地 ping 每个站点,可以使用
readLines
从以下位置拉回每个日期的 html 字符串: https://www.basketball-reference.com/boxscores/?month=11&day=4&year=2017所以只需遍历 link
中的月、日和年
从上面的 link 中,找到 hyperlink
final
下的项目,或者在 HTML 文本<a href="/boxscores/201711040DEN.html">Final</a>
.
可以使用正则表达式来解析每一行并搜索如下内容:
grep('.*<a href=\"/boxscores/.*.html\">Final</a>.*', [object], value = TRUE) %>%
gsub('.*<a href=\"(/boxscores/.*.html)\">Final</a>.*', '\1', .)
这将构建 link 游戏,然后您可以将其输入上面的解析器。
这对我有用,试试吧!
library(rvest)
page <- read_html('https://www.basketball-reference.com/playoffs/NBA_2017.html')
#get all links in the playoff section
playoffs <- page %>%
html_node('#div_all_playoffs') %>%
html_nodes('a') %>%
html_attr('href')
#limit to those that are actually links to boxscores
playoffs <- playoffs[grep('boxscore', playoffs)]
#loop to scrape each game
allGames <- list()
for(j in 1:length(playoffs)){
box <- read_html(paste0('https://www.basketball-reference.com/', playoffs[j]))
#tables are named based on which team is there, get all html id's to find which one we want
atrs <- box %>%
html_nodes('div') %>%
html_attr('id')
#limit to only names that include "basic" and "all"
basicIds <- atrs[grep('basic', atrs)] %>%
.[grep('all', .)]
#loop to scrape both tables (1 for each team)
teams <- list()
for(i in 1:length(basicIds)){
#grab table for team
table <- box %>%
html_node(paste0('#',basicIds[i])) %>%
html_node('.stats_table') %>%
html_table()
#parse table into starters and reserves tables
startReserve <- which(table[,1] == 'Reserves')
starters <- table[2:(startReserve-1),]
colnames(starters) <- table[1,]
reserves <- table[(startReserve + 1):nrow(table),]
colnames(reserves) <- table[startReserve,]
#extract team name
team <- gsub('all_box_(.+)_basic', '\1', basicIds[i])
#make named list using team name
assign(team, setNames(list(starters, reserves), c('starters', 'reserves')))
teams[[i]] <- team
}
#find game identifier
game <- gsub('/boxscores/(.+).html', '\1', playoffs[j])
#make list of both teams, name list using game identifier
assign(paste0('game_',game), setNames(list(eval(parse(text=teams[[1]])), eval(parse(text=teams[[2]]))), c(teams[[1]], teams[[2]])))
#add to allGames
allGames <- append(allGames, setNames(list(eval(parse(text = paste0('game_', game)))), paste0('game_', game)))
}
#clean up everything but allGames
rm(list = ls()[-grep('allGames', ls())])
输出是一个列表列表。这不是很好,但您想要的数据本质上是分层的:每场比赛有 2 支球队,每支球队有 2 tables(首发和替补)。所以,最终对象看起来像:
-allGames
----游戏1
--------团队 1
------------首发
------------预备队
--------Team2
------------首发
------------预备队
----游戏2 ...
例如,在总决赛的最后一场比赛中显示 table 克利夫兰首发的数据:
> allGames$game_201706120GSW$cle$starters
Starters MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS +/-
2 LeBron James 46:13 19 30 .633 2 5 .400 1 4 .250 2 11 13 8 2 1 2 3 41 -13
3 Kyrie Irving 41:47 9 22 .409 1 2 .500 7 7 1.000 1 1 2 6 2 0 4 3 26 +4
4 J.R. Smith 40:49 9 11 .818 7 8 .875 0 1 .000 0 3 3 1 0 2 0 2 25 -2
5 Kevin Love 29:55 2 8 .250 0 3 .000 2 5 .400 3 7 10 2 0 1 0 2 6 -23
6 Tristan Thompson 29:52 6 8 .750 0 0 3 4 .750 4 4 8 3 1 1 3 1 15 -7