如何在 rvest 的网络抓取过程中跳过网页

Question

我正在尝试使用 R 中的 rvest 包收集信息。在使用for循环收集数据时，我发现有些页面不包含信息，因此出现错误：Error in open.connection(x, "rb") : HTTP error 404.

这是我的 R 代码。页码 15138 和 15140 确实有信息，而 15139 没有。如何使用for循环函数跳过15139？

library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)



source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame() 
for (i in 15138:15140) { 
Sys.sleep(0.5) 
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")


prefecture <- recall_html %>%
        html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
        html_text()

city <- recall_html %>%
    html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl", " " ))]') %>%
    html_text()
city <- trimws(gsub("[\r\n]", "", city )) 



senkyo2 <- cbind(prefecture, city) 
senkyo  <- rbind(senkyo , senkyo2) 

}

期待您的回答！

Answer 1

您可以通过几种不同的方式处理异常。在抓取方面，我是 noob，但这里有一些适合您情况的选项。

定制你的循环范围

如果您知道您不需要值 15139，您可以从选项向量中删除，例如：

for (i in c(15138,15140)) {

当运行你的循环时，它将完全忽略 1539。

添加控制流

这与定制循环范围基本相同，但在循环本身内处理异常，例如：

for (i in 15138:15140) { 
Sys.sleep(0.5)
# control statement
if (i == 15139 {
  next # moves to next iteration of loop, in this case 15140
 }
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration

条件处理工具

这是我无法深入的地方，并不断参考 Advanced-R。从本质上讲，您可以将 try() 之类的函数包装在您可能有错误的代码周围，这可以使您的循环免受错误影响并防止其中断，并使您可以灵活地决定如果您的代码以特定方式中断时该怎么做。

我通常的做法是在您的代码中添加一些内容，例如：

# wrap the part of your code that can break in try() 
recall_html <- try(read_html(target_page, encoding = "UTF-8")) 
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
  next
 } else {
  prefecture <- recall_html %>%
        html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
        html_text()

如何在 rvest 的网络抓取过程中跳过网页

How to skip webpages during a web scraping in rvest

r

web-scraping

rvest

定制你的循环范围

添加控制流

条件处理工具