如何在 rvest 的网络抓取过程中跳过网页
How to skip webpages during a web scraping in rvest
我正在尝试使用 R 中的 rvest 包收集信息。
在使用for循环收集数据时,我发现有些页面不包含信息,因此出现错误:Error in open.connection(x, "rb") : HTTP error 404.
这是我的 R 代码。页码 15138 和 15140 确实有信息,而 15139 没有。如何使用for循环函数跳过15139?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
期待您的回答!
您可以通过几种不同的方式处理异常。在抓取方面,我是 noob
,但这里有一些适合您情况的选项。
定制你的循环范围
如果您知道您不需要值 15139
,您可以从选项向量中删除,例如:
for (i in c(15138,15140)) {
当 运行 你的循环时,它将完全忽略 1539
。
添加控制流
这与定制循环范围基本相同,但在循环本身内处理异常,例如:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
条件处理工具
这是我无法深入的地方,并不断参考 Advanced-R。从本质上讲,您可以将 try()
之类的函数包装在您可能有错误的代码周围,这可以使您的循环免受错误影响并防止其中断,并使您可以灵活地决定如果您的代码以特定方式中断时该怎么做。
我通常的做法是在您的代码中添加一些内容,例如:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
我正在尝试使用 R 中的 rvest 包收集信息。 在使用for循环收集数据时,我发现有些页面不包含信息,因此出现错误:Error in open.connection(x, "rb") : HTTP error 404.
这是我的 R 代码。页码 15138 和 15140 确实有信息,而 15139 没有。如何使用for循环函数跳过15139?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
期待您的回答!
您可以通过几种不同的方式处理异常。在抓取方面,我是 noob
,但这里有一些适合您情况的选项。
定制你的循环范围
如果您知道您不需要值 15139
,您可以从选项向量中删除,例如:
for (i in c(15138,15140)) {
当 运行 你的循环时,它将完全忽略 1539
。
添加控制流
这与定制循环范围基本相同,但在循环本身内处理异常,例如:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
条件处理工具
这是我无法深入的地方,并不断参考 Advanced-R。从本质上讲,您可以将 try()
之类的函数包装在您可能有错误的代码周围,这可以使您的循环免受错误影响并防止其中断,并使您可以灵活地决定如果您的代码以特定方式中断时该怎么做。
我通常的做法是在您的代码中添加一些内容,例如:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()