Rvest:为几个 html-链接建立一个队列
Rvest: Building a queue for several html-links
我目前正在网络抓取一份新闻杂志,但不幸的是,我不知道如何建立一个工作队列。我只能在一页上抓取所有文章的内容,但我想要一个队列,它会自动为其余文章做同样的事情。
library(rvest)
library(tidyverse)
library(data.table)
library(plyr)
library(writexl)
map_dfc(.x = c("em.entrylist__title", "time.entrylist__time"),
.f = function(x) {read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes(x) %>%
html_text()}) %>%
bind_cols(url = read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes("a.entrylist__link") %>%
html_attr("href")) %>%
setNames(nm = c("title", "time", "url")) -> temp
map_df(.x = temp$url[1:50],
.f = function(x){tibble(url = x,
text = read_html(x) %>%
html_nodes("#article-app-container > article > div.css-isuemq.e1lg1pmy0 > p:nth-child(n)") %>%
html_text() %>%
list
)}) %>%
unnest(text) -> foo
foo
X2 <- ddply(foo, .(url), summarize,
Xc=paste(text,collapse=","))
final <- merge(temp, X2, by="url")
在这种情况下,我得到了 30 页的文章,但我的脚本只支持抓取一页。
页面之间唯一变化的是页码 (https://www.sueddeutsche.de/news/**page/1**?search=...)
如果您能给我一个关于如何将所有页面一次包含到队列中的提示,我将不胜感激。非常感谢:)
Dataframe 形式的队列如何为您工作?
以下建议更通用一些,因此它可以超出特定 use-case。您可以随时添加更多 URL 进行抓取,但由于 dplyr::distinct
.
,只会保留 new 个
(我已经启动了队列以保留您想要抓取的前 5 个页面,如果您在 DOM 上找到链接,您可以立即添加或动态添加更多...)
library(dplyr)
library(lubridate)
queue <- tibble(
url = paste0("https://www.sueddeutsche.de/news/page/", 1:5, "?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020"),
scraped_time = lubridate::NA_POSIXct_
)
results <- list()
while(length(open_rows <- which(is.na(queue$scraped_time))) > 0) {
i <- open_rows[1]
url <- queue$url[i]
[...]
results[[url]] <- <YOUR SCRAPING RESULT>
queue$scraped_time[i] <- lubridate::now()
if (<MORE PAGES TO QUEUE>) {
queue <- queue %>%
tibble::add_row(url = c('www.spiegel.de', 'www.faz.de')) %>%
arrange(desc(scraped_time)) %>%
distinct(url, .keep_all = T)
}
}
我目前正在网络抓取一份新闻杂志,但不幸的是,我不知道如何建立一个工作队列。我只能在一页上抓取所有文章的内容,但我想要一个队列,它会自动为其余文章做同样的事情。
library(rvest)
library(tidyverse)
library(data.table)
library(plyr)
library(writexl)
map_dfc(.x = c("em.entrylist__title", "time.entrylist__time"),
.f = function(x) {read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes(x) %>%
html_text()}) %>%
bind_cols(url = read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes("a.entrylist__link") %>%
html_attr("href")) %>%
setNames(nm = c("title", "time", "url")) -> temp
map_df(.x = temp$url[1:50],
.f = function(x){tibble(url = x,
text = read_html(x) %>%
html_nodes("#article-app-container > article > div.css-isuemq.e1lg1pmy0 > p:nth-child(n)") %>%
html_text() %>%
list
)}) %>%
unnest(text) -> foo
foo
X2 <- ddply(foo, .(url), summarize,
Xc=paste(text,collapse=","))
final <- merge(temp, X2, by="url")
在这种情况下,我得到了 30 页的文章,但我的脚本只支持抓取一页。 页面之间唯一变化的是页码 (https://www.sueddeutsche.de/news/**page/1**?search=...)
如果您能给我一个关于如何将所有页面一次包含到队列中的提示,我将不胜感激。非常感谢:)
Dataframe 形式的队列如何为您工作?
以下建议更通用一些,因此它可以超出特定 use-case。您可以随时添加更多 URL 进行抓取,但由于 dplyr::distinct
.
,只会保留 new 个
(我已经启动了队列以保留您想要抓取的前 5 个页面,如果您在 DOM 上找到链接,您可以立即添加或动态添加更多...)
library(dplyr)
library(lubridate)
queue <- tibble(
url = paste0("https://www.sueddeutsche.de/news/page/", 1:5, "?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020"),
scraped_time = lubridate::NA_POSIXct_
)
results <- list()
while(length(open_rows <- which(is.na(queue$scraped_time))) > 0) {
i <- open_rows[1]
url <- queue$url[i]
[...]
results[[url]] <- <YOUR SCRAPING RESULT>
queue$scraped_time[i] <- lubridate::now()
if (<MORE PAGES TO QUEUE>) {
queue <- queue %>%
tibble::add_row(url = c('www.spiegel.de', 'www.faz.de')) %>%
arrange(desc(scraped_time)) %>%
distinct(url, .keep_all = T)
}
}