如何使用 rvest() 获得 table

Question

我想使用 rvest 包从 Pro Football Reference 网站获取一些数据。首先，让我们从这个 url http://www.pro-football-reference.com/years/2015/games.htm

中获取 2015 年所有游戏的结果

library("rvest")
library("dplyr")

#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html() 
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()

你会这样做吗？ :)

dat 可以稍微清理一下。其中两个变量的名称似乎为空白。另外，header 行在每周之间重复。

colnames(dat) <- c("week", "day", "date", "winner", "at", "loser", 
                   "box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")

dat2 <- dat %>% filter(!(box == ""))
head(dat2)

看起来不错！

下面我们来看一场个人比赛。在上面的网页中，单击 table 第一行中的 "Boxscore"：9 月 10 日在新英格兰队和匹兹堡队之间进行的比赛。这将我们带到这里：http://www.pro-football-reference.com/boxscores/201509100nwe.htm.

我想获取每个玩家的个人快照计数（大约在页面的一半处）。很确定这些将是我们的前两行代码：

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()

但现在我不知道如何获取我想要的特定 table。我使用选择器小工具突出显示 table 的爱国者快照计数。为此，我在几个地方单击 table，然后单击 'unclicking' 突出显示的其他 table。我最终得到了一条路径：

#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left

每一次尝试 returns {xml_nodeset (0)}

gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")

也许让我们尝试使用 xpath。所有这些尝试也 return {xml_nodeset (0)}

gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))] | //*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "tooltip", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')

我怎样才能抓住那个table？我还要指出，当我在 Google Chrome 中执行 "View Page Source" 时，我想要的 table 几乎被注释掉了？也就是说，它们以绿色输入，而不是通常的 red/black/blue 配色方案。我们首先拉出的 table 游戏结果并非如此。 "View Page Source" 因为 table 是通常的 red/black/blue 配色方案。绿色是否表示阻止我获取此快照计数的原因 table？

谢谢！

Answer 1

您正在查找的信息以编程方式显示在运行时间。一种解决方案是使用 RSelenium。

查看网页的源代码时，表格中的信息存储在代码中，但由于表格存储为注释而被隐藏。这是我的解决方案，我删除了评论标记并正常重新处理页面。

我将文件保存到工作目录，然后使用 readLines 函数读取文件。
现在我搜索 html 的开始和结束注释标志，然后删除它们。我第二次保存文件（减去注释标志）以便重新读取和处理所选表的文件。

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
gameHtml %>% html_nodes("tbody")

#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nfl.xml")

#Find and remove comments
lines<-readLines("nfl.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nfl2.xml")

#Read the file back in and process normally
body<-read_html("nfl2.xml")
html_table(html_nodes(body, "table")[29])

#extract the attributes and find the attribute of interest
a<-html_attrs(html_nodes(body, "table"))

#find the tables of interest.
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts")
html_table(html_nodes(body, "table")[homesnap])

visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts")
html_table(html_nodes(body, "table")[visitsnap])

如何使用 rvest() 获得 table

How to get table using rvest()

r

web-scraping

rvest