使用 purrr 抓取多个页面时出错

Question

我正在尝试抓取以类似方式设置的多个网页（例如：https://www.foreign.senate.gov/hearings/120314am）。我创建的函数在使用一个 url 时有效，但在尝试映射多个页面时出现错误。

这是函数的简化版本。

scrape <- function(url){
   url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)

      product <- url %>%
      read_html() %>%
      html_nodes("#main_column")

      names <- product %>%
      html_nodes(".fn") %>%
      html_text() %>% 
      gsub("\n", "",.) %>% 
      gsub("\t", "",.) 

      tibble(Witness_Name = names)
    }

将 url 存储到对象中并尝试映射时出现错误。

hearing_name <- c("the-ebola-epidemic-the-keys-to-success-for-the-international-response",
              "120314am")

map_df(hearing_name, scrape)


Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
Expecting a single string value: [type=character; extent=2].

我曾尝试使用 lapply() 并重组为极简主义方法，但没有成功。希望有人能帮助我！

Answer 1

在函数内部，有一个硬编码的 hearing_name 而不是 'url'

url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)

如果我们将其更改为 url

scrape <- function(url){
   url <- paste0("https://www.foreign.senate.gov/hearings/", url)

      product <- url %>%
      read_html() %>%
      html_nodes("#main_column")

      names <- product %>%
      html_nodes(".fn") %>%
      html_text() %>% 
      gsub("\n", "",.) %>% 
      gsub("\t", "",.) 

      tibble(Witness_Name = names)
    }

代码可以正常工作

out <- map_df(hearing_name, scrape)
dim(out)
#[1] 8 1
out
# A tibble: 8 x 1
#  Witness_Name        
#  <chr>               
#1 Ellen JohnsonSirleaf
#2 PaulFarmer          
#3 AnnePeterson        
#4 PapeGaye            
#5 JavierAlvarez       
#6 DanielRussel        
#7 Richard C.Bush III  
#8 SophieRichardson

使用 purrr 抓取多个页面时出错

Error when using purrr to scrape multiple pages

r

rvest

purrr

tidyverse