我如何使用 R 从 url 列表中抓取新闻故事?
How can I time scraping news stories from a list of urls with R?
我正在尝试使用 R 下载报纸文章的文本以进行文本分析。我有大量个别文章的 url 列表,想使用 Rvest 提取每篇文章的文本和标题并将其转换为一个数据框。
例如,我的数据集的一个子集包含《卫报》的文章:
> items$link[1:8]
[1] "https://www.theguardian.com/uk-news/2019/nov/16/concerns-raised-cladding-bolton-student-building-fire"
[2] "https://www.theguardian.com/uk-news/2019/nov/16/top-lawyer-calls-prince-andrew-bbc-interview-catastrophic-error"
[3] "https://www.theguardian.com/politics/live/2019/nov/16/general-election-labour-meet-decide-manifesto-clause-v-live-news"
[4] "https://www.theguardian.com/politics/2019/nov/16/priti-patel-block-rescue-british-isis-children"
[5] "https://www.theguardian.com/politics/2019/nov/16/police-assessing-claims-that-tories-offered-peerages-to-brexit-party"
[6] "https://www.theguardian.com/world/2019/nov/16/paris-police-fire-teargas-on-anniversary-of-gilets-jaunes-protests"
[7] "https://www.theguardian.com/us-news/2019/nov/16/trump-personally-kept-pressure-ukraine-impeachment-inquiry-witness-david-holmes-diplomat"
[8] "https://www.theguardian.com/world/2019/nov/16/hong-kong-chinese-troops-deployed-to-help-clear-roadblocks"
到目前为止我的代码是:
## SETUP ##
rm(list=ls())
library(tidyverse)
library(rvest)
library(stringr)
library(readtext)
library(quanteda)
library(beepr)
setwd("uk")
## Functions ##
parse_texts <- function(nod){
body <- str_squish(as.character(nod) %>% read_html() %>%
html_nodes('.js-article__body > p') %>% #collects all text in article
html_text())
one_body <- paste(body, collapse = " ") # puts all of the text together
data.frame(title = str_squish(nod %>% read_html() %>%
html_node('.content__headline') %>%
html_text()),
date_time = str_squish(nod %>% read_html() %>%
html_node('.content__dateline-wpd--modified') %>%
html_text()),
text = one_body,
stringsAsFactors = FALSE)
}
#extract file text
test_df <- lapply(items$link[1:5], parse_texts) %>% bind_rows()
这在大多数情况下都有效。我的问题是我的数据中有数千个 url。我怎样才能自动执行一个将缓慢完成此列表的脚本?
感谢 Dave2e 回答问题。
我将 Sys.sleep(2)
添加到 parse_texts
函数并且能够浏览我的 URL 列表。
我正在尝试使用 R 下载报纸文章的文本以进行文本分析。我有大量个别文章的 url 列表,想使用 Rvest 提取每篇文章的文本和标题并将其转换为一个数据框。
例如,我的数据集的一个子集包含《卫报》的文章:
> items$link[1:8]
[1] "https://www.theguardian.com/uk-news/2019/nov/16/concerns-raised-cladding-bolton-student-building-fire"
[2] "https://www.theguardian.com/uk-news/2019/nov/16/top-lawyer-calls-prince-andrew-bbc-interview-catastrophic-error"
[3] "https://www.theguardian.com/politics/live/2019/nov/16/general-election-labour-meet-decide-manifesto-clause-v-live-news"
[4] "https://www.theguardian.com/politics/2019/nov/16/priti-patel-block-rescue-british-isis-children"
[5] "https://www.theguardian.com/politics/2019/nov/16/police-assessing-claims-that-tories-offered-peerages-to-brexit-party"
[6] "https://www.theguardian.com/world/2019/nov/16/paris-police-fire-teargas-on-anniversary-of-gilets-jaunes-protests"
[7] "https://www.theguardian.com/us-news/2019/nov/16/trump-personally-kept-pressure-ukraine-impeachment-inquiry-witness-david-holmes-diplomat"
[8] "https://www.theguardian.com/world/2019/nov/16/hong-kong-chinese-troops-deployed-to-help-clear-roadblocks"
到目前为止我的代码是:
## SETUP ##
rm(list=ls())
library(tidyverse)
library(rvest)
library(stringr)
library(readtext)
library(quanteda)
library(beepr)
setwd("uk")
## Functions ##
parse_texts <- function(nod){
body <- str_squish(as.character(nod) %>% read_html() %>%
html_nodes('.js-article__body > p') %>% #collects all text in article
html_text())
one_body <- paste(body, collapse = " ") # puts all of the text together
data.frame(title = str_squish(nod %>% read_html() %>%
html_node('.content__headline') %>%
html_text()),
date_time = str_squish(nod %>% read_html() %>%
html_node('.content__dateline-wpd--modified') %>%
html_text()),
text = one_body,
stringsAsFactors = FALSE)
}
#extract file text
test_df <- lapply(items$link[1:5], parse_texts) %>% bind_rows()
这在大多数情况下都有效。我的问题是我的数据中有数千个 url。我怎样才能自动执行一个将缓慢完成此列表的脚本?
感谢 Dave2e 回答问题。
我将 Sys.sleep(2)
添加到 parse_texts
函数并且能够浏览我的 URL 列表。