R 网络抓取,不确定如何进行
R webscraping, unsure how to proceed
作为一个附带项目,我正在尝试收集 NFL 球员与梦幻足球相关的统计数据。我找到了一个 URL 有我想要的数据:
https://www.cbssports.com/fantasy/football/stats/QB/2020/ytd/stats/ppr/
我正试图在 R 中抓取它,但没有成功。我已经尝试了很多东西,我得到的最接近的是:
Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr')
这是我目前得到的代码,这是结果:
Test1
{xml_nodeset (69)}
[1] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
[2] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
[3] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
[4] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
我试过将其输入 html_text() 结果是:
[65] "\n \n \n \n \n \n J. Eason\n \n \n \n QB\n \n \n \n IND\n \n \n \n \n \n \n \n
里面嵌入了相关信息,简直是一团糟。我还尝试在其上使用 html_table(),但出现错误。
现在,如果我在“Test1”上使用 View 函数,我可以钻取多层数据并找到我要查找的内容,但我想弄清楚的是如何获取该数据直接。
我不太确定从这里到哪里去。如果有人能给我一些指示,我将不胜感激。我对 HTML 的熟悉程度非常低,我正在尝试阅读更多相关信息并理解它,但是通过检查页面我能够收集到的是数据存储在 class 中” TableBase-bodyTr”,这就是我将节点指向那里的原因。
table 格式有些奇怪,导致 html_table()
出错。不太确定如何更正。
这是一个替代方法,可以抓取行的内容,然后创建数据框。
library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/")
#find the rows of the table
rows<-page%>% html_nodes('tr')
#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws()
playername <- matrix(playername, ncol=2, byrow=TRUE)
#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws()
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws()
#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws()
stats <-matrix(cols, ncol=16, byrow=TRUE)
#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team", 'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames
这将使您达到 95%,我没有尝试从网页中自动检索列名称。手动复制粘贴和分配列名称更容易。
作为一个附带项目,我正在尝试收集 NFL 球员与梦幻足球相关的统计数据。我找到了一个 URL 有我想要的数据: https://www.cbssports.com/fantasy/football/stats/QB/2020/ytd/stats/ppr/
我正试图在 R 中抓取它,但没有成功。我已经尝试了很多东西,我得到的最接近的是:
Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr')
这是我目前得到的代码,这是结果:
Test1
{xml_nodeset (69)}
[1] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
[2] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
[3] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
[4] <tr class="TableBase-bodyTr">\n<td class="TableBase-bodyTd \n \n \n \n ">\n <span class="CellPlayerName--sho ...
我试过将其输入 html_text() 结果是:
[65] "\n \n \n \n \n \n J. Eason\n \n \n \n QB\n \n \n \n IND\n \n \n \n \n \n \n \n
里面嵌入了相关信息,简直是一团糟。我还尝试在其上使用 html_table(),但出现错误。
现在,如果我在“Test1”上使用 View 函数,我可以钻取多层数据并找到我要查找的内容,但我想弄清楚的是如何获取该数据直接。
我不太确定从这里到哪里去。如果有人能给我一些指示,我将不胜感激。我对 HTML 的熟悉程度非常低,我正在尝试阅读更多相关信息并理解它,但是通过检查页面我能够收集到的是数据存储在 class 中” TableBase-bodyTr”,这就是我将节点指向那里的原因。
table 格式有些奇怪,导致 html_table()
出错。不太确定如何更正。
这是一个替代方法,可以抓取行的内容,然后创建数据框。
library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/")
#find the rows of the table
rows<-page%>% html_nodes('tr')
#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws()
playername <- matrix(playername, ncol=2, byrow=TRUE)
#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws()
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws()
#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws()
stats <-matrix(cols, ncol=16, byrow=TRUE)
#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team", 'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames
这将使您达到 95%,我没有尝试从网页中自动检索列名称。手动复制粘贴和分配列名称更容易。