相同的 webscrape 代码在一个页面上工作，而不是另一个使用 rvest

Question

我构建了一个简单的抓取来获取包含 2020 年 NFL 选秀结果的数据框。我打算使用此代码来映射几年的结果，但出于某种原因，当我更改单页抓取的代码时2020 年以外的任何一年，我都会在底部看到错误。

library(tidyverse)
library(rvest)
library(httr)
library(curl)

虽然 col 名称在第 1 行中，但 2020 年的这个 scrape 工作完美无瑕，这对我来说不是什么大问题，因为我可以稍后处理（尽管提及以防万一这可能与问题有关） :

x <- "https://www.pro-football-reference.com/years/2020/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>% 
        html_nodes("table") %>% 
        html_table() %>%
        as.data.frame()

下方的 url 已从 2020 更改为 2019，这是具有相同格式的 table 的活动页面。由于某种原因，与上面相同的调用不起作用：

x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>% 
        html_nodes("table") %>% 
        html_table() %>%
        as.data.frame()

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
arguments imply differing number of rows: 261, 2

Answer 1

提供的 url 处有两个 table。有核心稿(table1,id = "drafts")和补充稿(table2,id = "drafts_supp").

as.data.frame() 调用失败，因为它试图合并两个 table，但它们的名称和编号列不同。通过为 html_node() 提供 xpath 或 selector，您可以指示 rvest 仅阅读您感兴趣的特定 table。您可以通过检查您感兴趣的特定 table 找到 xpath 或 selector，右键单击 > 检查 Chrome/Mozilla。请注意，要使选择器使用 id，您需要使用 #drafts 而不仅仅是 drafts，对于 xpath，您通常必须将文本用单引号引起来。

这个有效：html_node(xpath = '//*[@id="drafts"]')
这不是因为双引号：html_node(xpath = "//*[@id="drafts"]")

请注意，我认为您示例中使用的 html_nodes("table") 是不必要的，因为 html_table() 已经只选择了 tables。

x <- "https://www.pro-football-reference.com/years/2019/draft.htm"

raw_html <- read_html(x)

# use xpath
raw_html %>% 
  html_node(xpath = '//*[@id="drafts"]') %>%
  html_table()

# use selector
raw_html %>% 
  html_node("#drafts") %>% 
  html_table()

相同的 webscrape 代码在一个页面上工作，而不是另一个使用 rvest

Same webscrape code works on one page, not another using rvest

r

web-scraping

rvest