R 中的网页抓取:如何处理重定向?
Webscraping in R: How to deal with redirects?
过去几周,我为瑞士的几个新闻网站构建了一个网络爬虫。他们中的大多数都可以工作,但是一页给了我很多重定向,我不知道该怎么做。
这是我的代码:
library(rvest)
library(tidyverse)
library(dplyr)
api20min<- read_xml('https://api.20min.ch/rss/view/1')
urls_20min <- api20min %>% html_nodes('link') %>% html_text()
urls_20min <- urls_20min[-c(1:2)]
zeit_20min <- api20min %>% html_nodes('pubDate') %>% html_text()
titel_20min <- api20min %>% html_nodes('title') %>% html_text()
titel_20min <- titel_20min[-c(1:2)]
df20min_titel_zeit_link <- data.frame(urls_20min,zeit_20min,titel_20min)
df20min_text <- do.call(rbind, lapply(urls_20min, function(x) {
paste0(read_html(x) %>% html_nodes('.story_text p') %>% html_text(), collapse = "\n\n")
}))
df_20min <- data.frame(df20min_titel_zeit_link,df20min_text)
如果我 运行 在其他页面上使用相同的代码,它就可以完美运行。实际上,我上周写了这篇文章,然后它仍然有效。但是现在 R 告诉我:
"Fehler in open.connection(x, "rb") : Maximum (10) redirects followed"
那么我该如何绕过这些重定向?
谢谢你们的帮助,你们太棒了!
如果将 lapply 转换为循环,您会发现 urls_20min
中位置 23 的 url 导致了问题。这是一个促销 link,因此重定向。您可以 grepl
输出任何带有 "promo" 的 url,它工作得很好:
library(rvest)
library(tidyverse)
library(dplyr)
api20min <- read_xml('https://api.20min.ch/rss/view/1')
urls_20min <- api20min %>% html_nodes('link') %>% html_text()
urls_20min <- urls_20min[-c(1:2)]
no_promo <- !grepl("promo", urls_20min)
zeit_20min <- api20min %>% html_nodes('pubDate') %>% html_text()
titel_20min <- api20min %>% html_nodes('title') %>% html_text()
titel_20min <- titel_20min[-c(1:2)]
df20min_titel_zeit_link <- data.frame(urls_20min,zeit_20min,titel_20min)[no_promo,]
df20min_text <- do.call(rbind,
lapply(urls_20min[no_promo],
function(x) {
paste0(read_html(x) %>%
html_nodes('.story_text p') %>%
html_text(),
collapse = "\n\n")}))
df_20min <- data.frame(df20min_titel_zeit_link, df20min_text)
结果太大无法显示,但这是它的结构:
str(df_20min)
#> 'data.frame': 111 obs. of 4 variables:
#> $ urls_20min : Factor w/ 112 levels "https://beta.20min.ch/story/269082903107?legacy=true",..: 44 78 93 49 81 76 91 70 95 4 ...
#> $ zeit_20min : Factor w/ 111 levels "Fri, 10 Apr 2020 03:00:00 GMT",..: 98 89 61 90 105 99 109 84 82 83 ...
#> $ titel_20min : Factor w/ 112 levels " : Die Bilder des Tages",..: 44 48 55 76 16 20 6 17 112 63 ...
#> $ df20min_text: Factor w/ 87 levels "","\n\n","\n\n(20 Minuten)",..: 62 44 48 71 64 43 83 39 30 1 ...
过去几周,我为瑞士的几个新闻网站构建了一个网络爬虫。他们中的大多数都可以工作,但是一页给了我很多重定向,我不知道该怎么做。
这是我的代码:
library(rvest)
library(tidyverse)
library(dplyr)
api20min<- read_xml('https://api.20min.ch/rss/view/1')
urls_20min <- api20min %>% html_nodes('link') %>% html_text()
urls_20min <- urls_20min[-c(1:2)]
zeit_20min <- api20min %>% html_nodes('pubDate') %>% html_text()
titel_20min <- api20min %>% html_nodes('title') %>% html_text()
titel_20min <- titel_20min[-c(1:2)]
df20min_titel_zeit_link <- data.frame(urls_20min,zeit_20min,titel_20min)
df20min_text <- do.call(rbind, lapply(urls_20min, function(x) {
paste0(read_html(x) %>% html_nodes('.story_text p') %>% html_text(), collapse = "\n\n")
}))
df_20min <- data.frame(df20min_titel_zeit_link,df20min_text)
如果我 运行 在其他页面上使用相同的代码,它就可以完美运行。实际上,我上周写了这篇文章,然后它仍然有效。但是现在 R 告诉我:
"Fehler in open.connection(x, "rb") : Maximum (10) redirects followed"
那么我该如何绕过这些重定向?
谢谢你们的帮助,你们太棒了!
如果将 lapply 转换为循环,您会发现 urls_20min
中位置 23 的 url 导致了问题。这是一个促销 link,因此重定向。您可以 grepl
输出任何带有 "promo" 的 url,它工作得很好:
library(rvest)
library(tidyverse)
library(dplyr)
api20min <- read_xml('https://api.20min.ch/rss/view/1')
urls_20min <- api20min %>% html_nodes('link') %>% html_text()
urls_20min <- urls_20min[-c(1:2)]
no_promo <- !grepl("promo", urls_20min)
zeit_20min <- api20min %>% html_nodes('pubDate') %>% html_text()
titel_20min <- api20min %>% html_nodes('title') %>% html_text()
titel_20min <- titel_20min[-c(1:2)]
df20min_titel_zeit_link <- data.frame(urls_20min,zeit_20min,titel_20min)[no_promo,]
df20min_text <- do.call(rbind,
lapply(urls_20min[no_promo],
function(x) {
paste0(read_html(x) %>%
html_nodes('.story_text p') %>%
html_text(),
collapse = "\n\n")}))
df_20min <- data.frame(df20min_titel_zeit_link, df20min_text)
结果太大无法显示,但这是它的结构:
str(df_20min)
#> 'data.frame': 111 obs. of 4 variables:
#> $ urls_20min : Factor w/ 112 levels "https://beta.20min.ch/story/269082903107?legacy=true",..: 44 78 93 49 81 76 91 70 95 4 ...
#> $ zeit_20min : Factor w/ 111 levels "Fri, 10 Apr 2020 03:00:00 GMT",..: 98 89 61 90 105 99 109 84 82 83 ...
#> $ titel_20min : Factor w/ 112 levels " : Die Bilder des Tages",..: 44 48 55 76 16 20 6 17 112 63 ...
#> $ df20min_text: Factor w/ 87 levels "","\n\n","\n\n(20 Minuten)",..: 62 44 48 71 64 43 83 39 30 1 ...