使用 rvest 抓取多个 URL

Question

在 rvest 中使用 read_html 时如何抓取多个网址？目标是从各自的 url 中获取包含文本正文的单个文档，运行对其进行各种分析。

我尝试连接网址：

 url <- c("https://www.vox.com/","https://www.cnn.com/")
   page <-read_html(url)
   page
   story <- page %>%
        html_nodes("p") %>%  
        html_text

read_html后报错：

 Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
 Expecting a single string value: [type=character; extent=3].

并不奇怪，因为 read_html 可能一次只处理一条路径。但是，我可以使用不同的函数或转换，以便同时抓取多个页面吗？

Answer 1

您可以使用 map（或在 base R 中：lapply）循环遍历每个 url 元素；这是一个例子

url <- c("https://www.vox.com/", "https://www.bbc.com/")
page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())
str(page)
#List of 2
# $ : chr [1:22] "But he was acquitted on the two most serious charges he faced." "Health experts say it’s time to prepare for worldwide spread on all continents." "Wall Street is waking up to the threat of coronavirus as fears about the disease and its potential global econo"| __truncated__ "Johnson, who died Monday at age 101, did groundbreaking work in helping return astronauts safely to Earth." ...
# $ : chr [1:19] "" "\n                                                            The ex-movie mogul is handcuffed and led from cou"| __truncated__ "" "27°C" ...

return 对象是 list。

PS。我更改了第二个 url 元素，因为 "https://www.cnn.com/" returned NULL for html_nodes("p") %>% html_text().

使用 rvest 抓取多个 URL

Scrape multiple URLs with rvest

html

screen-scraping

r

rvest