我如何使用 R 从 url 列表中抓取新闻故事？

Question

我正在尝试使用 R 下载报纸文章的文本以进行文本分析。我有大量个别文章的 url 列表，想使用 Rvest 提取每篇文章的文本和标题并将其转换为一个数据框。

例如，我的数据集的一个子集包含《卫报》的文章：

> items$link[1:8]

[1] "https://www.theguardian.com/uk-news/2019/nov/16/concerns-raised-cladding-bolton-student-building-fire"                                   
[2] "https://www.theguardian.com/uk-news/2019/nov/16/top-lawyer-calls-prince-andrew-bbc-interview-catastrophic-error"                         
[3] "https://www.theguardian.com/politics/live/2019/nov/16/general-election-labour-meet-decide-manifesto-clause-v-live-news"                  
[4] "https://www.theguardian.com/politics/2019/nov/16/priti-patel-block-rescue-british-isis-children"                                         
[5] "https://www.theguardian.com/politics/2019/nov/16/police-assessing-claims-that-tories-offered-peerages-to-brexit-party"                   
[6] "https://www.theguardian.com/world/2019/nov/16/paris-police-fire-teargas-on-anniversary-of-gilets-jaunes-protests"                        
[7] "https://www.theguardian.com/us-news/2019/nov/16/trump-personally-kept-pressure-ukraine-impeachment-inquiry-witness-david-holmes-diplomat"
[8] "https://www.theguardian.com/world/2019/nov/16/hong-kong-chinese-troops-deployed-to-help-clear-roadblocks"

到目前为止我的代码是：

## SETUP ##
rm(list=ls())
library(tidyverse)
library(rvest)
library(stringr)
library(readtext)
library(quanteda)
library(beepr)

setwd("uk")

## Functions ##
parse_texts <- function(nod){
  body <- str_squish(as.character(nod) %>% read_html() %>%
                       html_nodes('.js-article__body > p') %>% #collects all text in article
                       html_text())
  one_body <- paste(body, collapse = " ") # puts all of the text together
  data.frame(title = str_squish(nod %>% read_html() %>% 
                                  html_node('.content__headline') %>% 
                                  html_text()),
             date_time = str_squish(nod %>% read_html() %>% 
                                      html_node('.content__dateline-wpd--modified') %>% 
                                      html_text()),
             text = one_body,
             stringsAsFactors = FALSE)
}

#extract file text
test_df <- lapply(items$link[1:5], parse_texts) %>% bind_rows()

这在大多数情况下都有效。我的问题是我的数据中有数千个 url。我怎样才能自动执行一个将缓慢完成此列表的脚本？

Answer 1

感谢 Dave2e 回答问题。

我将 Sys.sleep(2) 添加到 parse_texts 函数并且能够浏览我的 URL 列表。

我如何使用 R 从 url 列表中抓取新闻故事？

How can I time scraping news stories from a list of urls with R?

html

r

web-scraping

rvest