使用tryCatch和rvest处理404等爬虫错误
Using tryCatch and rvest to deal with 404 and other crawling errors
当使用 rvest
检索 h1 标题时,我有时会 运行 进入 404 页。这将停止进程并 returns 此错误。
Error in open.connection(x, "rb") : HTTP error 404.
参见下面的示例
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
用于检索 h1 的代码
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
有没有办法包含一个参数来忽略错误并继续该过程?
可以看这个问题解释一下here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)
您正在寻找 try
或 tryCatch
,它们是 R 处理错误捕获的方式。
使用try
,你只需要将可能失败的东西包装在try()
中,它将return错误并保持运行ning:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
然而,虽然这会得到一切,但它也会将错误数据插入到我们的结果中。 tryCatch
允许您配置调用错误时发生的情况,方法是在出现这种情况时将函数传递给 运行:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
我们开始了;好多了。
更新
在 tidyverse 中,purrr
包提供了两个函数,safely
和 possibly
,它们的工作方式类似于 try
和 tryCatch
。它们是 副词 ,而不是动词,这意味着它们接受一个函数,对其进行修改以处理错误,然后 return 一个新函数(不是数据对象),然后可以叫。示例:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
当使用 rvest
检索 h1 标题时,我有时会 运行 进入 404 页。这将停止进程并 returns 此错误。
Error in open.connection(x, "rb") : HTTP error 404.
参见下面的示例
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
用于检索 h1 的代码
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
有没有办法包含一个参数来忽略错误并继续该过程?
可以看这个问题解释一下here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)
您正在寻找 try
或 tryCatch
,它们是 R 处理错误捕获的方式。
使用try
,你只需要将可能失败的东西包装在try()
中,它将return错误并保持运行ning:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
然而,虽然这会得到一切,但它也会将错误数据插入到我们的结果中。 tryCatch
允许您配置调用错误时发生的情况,方法是在出现这种情况时将函数传递给 运行:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
我们开始了;好多了。
更新
在 tidyverse 中,purrr
包提供了两个函数,safely
和 possibly
,它们的工作方式类似于 try
和 tryCatch
。它们是 副词 ,而不是动词,这意味着它们接受一个函数,对其进行修改以处理错误,然后 return 一个新函数(不是数据对象),然后可以叫。示例:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>