R 网络抓取,不确定如何进行

R webscraping, unsure how to proceed

作为一个附带项目,我正在尝试收集 NFL 球员与梦幻足球相关的统计数据。我找到了一个 URL 有我想要的数据: https://www.cbssports.com/fantasy/football/stats/QB/2020/ytd/stats/ppr/

我正试图在 R 中抓取它,但没有成功。我已经尝试了很多东西,我得到的最接近的是:

Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr')

这是我目前得到的代码,这是结果:

Test1
{xml_nodeset (69)}
 [1] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n                \n                \n                \n                ">\n                    <span class="CellPlayerName--sho ...
 [2] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n                \n                \n                \n                ">\n                    <span class="CellPlayerName--sho ...
 [3] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n                \n                \n                \n                ">\n                    <span class="CellPlayerName--sho ...
 [4] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n                \n                \n                \n                ">\n                    <span class="CellPlayerName--sho ...

我试过将其输入 html_text() 结果是:

[65] "\n                    \n                        \n                        \n            \n                                                                                                    \n            J. Eason\n    \n                                        \n                                    \n                        QB\n                    \n                    \n                                    \n                        IND\n                    \n                                \n                \n                \n                            \n        \n        \n            

里面嵌入了相关信息,简直是一团糟。我还尝试在其上使用 html_table(),但出现错误。

现在,如果我在“Test1”上使用 View 函数,我可以钻取多层数据并找到我要查找的内容,但我想弄清楚的是如何获取该数据直接。

我不太确定从这里到哪里去。如果有人能给我一些指示,我将不胜感激。我对 HTML 的熟悉程度非常低,我正在尝试阅读更多相关信息并理解它,但是通过检查页面我能够收集到的是数据存储在 class 中” TableBase-bodyTr”,这就是我将节点指向那里的原因。

table 格式有些奇怪,导致 html_table() 出错。不太确定如何更正。

这是一个替代方法,可以抓取行的内容,然后创建数据框。

library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") 

#find the rows of the table
rows<-page%>% html_nodes('tr')

#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws() 
playername <- matrix(playername, ncol=2, byrow=TRUE)

#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws() 
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws() 

#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws() 
stats <-matrix(cols, ncol=16, byrow=TRUE)

#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team",  'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames

这将使您达到 95%,我没有尝试从网页中自动检索列名称。手动复制粘贴和分配列名称更容易。