使用 purrr 抓取多个页面时出错
Error when using purrr to scrape multiple pages
我正在尝试抓取以类似方式设置的多个网页(例如:https://www.foreign.senate.gov/hearings/120314am)。我创建的函数在使用一个 url 时有效,但在尝试映射多个页面时出现错误。
这是函数的简化版本。
scrape <- function(url){
url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)
product <- url %>%
read_html() %>%
html_nodes("#main_column")
names <- product %>%
html_nodes(".fn") %>%
html_text() %>%
gsub("\n", "",.) %>%
gsub("\t", "",.)
tibble(Witness_Name = names)
}
将 url 存储到对象中并尝试映射时出现错误。
hearing_name <- c("the-ebola-epidemic-the-keys-to-success-for-the-international-response",
"120314am")
map_df(hearing_name, scrape)
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=2].
我曾尝试使用 lapply() 并重组为极简主义方法,但没有成功。希望有人能帮助我!
在函数内部,有一个硬编码的 hearing_name 而不是 'url'
url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)
如果我们将其更改为 url
scrape <- function(url){
url <- paste0("https://www.foreign.senate.gov/hearings/", url)
product <- url %>%
read_html() %>%
html_nodes("#main_column")
names <- product %>%
html_nodes(".fn") %>%
html_text() %>%
gsub("\n", "",.) %>%
gsub("\t", "",.)
tibble(Witness_Name = names)
}
代码可以正常工作
out <- map_df(hearing_name, scrape)
dim(out)
#[1] 8 1
out
# A tibble: 8 x 1
# Witness_Name
# <chr>
#1 Ellen JohnsonSirleaf
#2 PaulFarmer
#3 AnnePeterson
#4 PapeGaye
#5 JavierAlvarez
#6 DanielRussel
#7 Richard C.Bush III
#8 SophieRichardson
我正在尝试抓取以类似方式设置的多个网页(例如:https://www.foreign.senate.gov/hearings/120314am)。我创建的函数在使用一个 url 时有效,但在尝试映射多个页面时出现错误。
这是函数的简化版本。
scrape <- function(url){
url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)
product <- url %>%
read_html() %>%
html_nodes("#main_column")
names <- product %>%
html_nodes(".fn") %>%
html_text() %>%
gsub("\n", "",.) %>%
gsub("\t", "",.)
tibble(Witness_Name = names)
}
将 url 存储到对象中并尝试映射时出现错误。
hearing_name <- c("the-ebola-epidemic-the-keys-to-success-for-the-international-response",
"120314am")
map_df(hearing_name, scrape)
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=2].
我曾尝试使用 lapply() 并重组为极简主义方法,但没有成功。希望有人能帮助我!
在函数内部,有一个硬编码的 hearing_name 而不是 'url'
url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)
如果我们将其更改为 url
scrape <- function(url){
url <- paste0("https://www.foreign.senate.gov/hearings/", url)
product <- url %>%
read_html() %>%
html_nodes("#main_column")
names <- product %>%
html_nodes(".fn") %>%
html_text() %>%
gsub("\n", "",.) %>%
gsub("\t", "",.)
tibble(Witness_Name = names)
}
代码可以正常工作
out <- map_df(hearing_name, scrape)
dim(out)
#[1] 8 1
out
# A tibble: 8 x 1
# Witness_Name
# <chr>
#1 Ellen JohnsonSirleaf
#2 PaulFarmer
#3 AnnePeterson
#4 PapeGaye
#5 JavierAlvarez
#6 DanielRussel
#7 Richard C.Bush III
#8 SophieRichardson