R:手动停止时从循环返回一个值

R : Returning a value from loop when manually stopping it

我正在尝试用 rvest 建立一个数据库。因为我有很多数据要下载,所以我尝试编写几个函数来中断抓取过程并在我离开的地方重新启动它。然而,虽然这些功能或多或少地起作用,但每当我手动中断它们时,我都会丢失输出。有谁知道一个解决方案可以让我在不丢失循环正在构建的数据帧的情况下停止该功能?我很乐意提供任何建议!

我试图从中抓取数据的一些网址:

to_do <- c("https://jobs.51job.com/shenzhen-nsq/116924235.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116923692.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116923628.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116923578.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116920896.html?s=01&t=0")

我创建的函数供下载:

# In order to initiate the dowload
dl_data_start <- function(to_do){
  output <- tibble()
  i = 1
  while (to_do[i] %in% to_do) {
      page <- read_html(to_do[i])
      position <- page %>%
        html_nodes(.,'h1') %>%
        html_text(.)
      resume <- page %>%
        html_nodes(.,'.ltype') %>%
        html_text(.)
      job_offer <- page %>%
        html_nodes(.,'.job_msg') %>%
        html_text(.)
      eps <- page %>%
        html_nodes(.,'.com_msg') %>%
        html_text(.)
      eps_status <- page %>%
        html_nodes(.,'.at:nth-child(1)') %>%
        html_text(.)
      eps_description <- page %>%
        html_nodes(.,'.tmsg') %>%
        html_text(.)
      employees <- page %>%
        html_nodes(.,'.at:nth-child(2)') %>%
        html_text(.)
      category <- page %>%
        html_nodes(.,'.at:nth-child(3)') %>%
        html_text(.)
      salary <- page %>%
        html_nodes(.,'.cn strong') %>%
        html_text(.)
      url <- to_do[i]
      id <- i
      current <- tibble(position,resume,job_offer,eps,eps_description,eps_status,
                        employees,category,salary,url,id)
      output <- bind_rows(output,current)
      print(output[i,])
      i = i + 1
  }
  return(output)
}
# function in order to continue the download where I left it
dl_data_continue <- function(to_do,df,done){
  i = (match(tail(done,n=1),to_do) + 1)
  while (to_do[i] %in% to_do) {
    page <- read_html(to_do[i])
    position <- page %>%
      html_nodes(.,'h1') %>%
      html_text(.)
    resume <- page %>%
      html_nodes(.,'.ltype') %>%
      html_text(.)
    job_offer <- page %>%
      html_nodes(.,'.job_msg') %>%
      html_text(.)
    eps <- page %>%
      html_nodes(.,'.com_msg') %>%
      html_text(.)
    eps_status <- page %>%
      html_nodes(.,'.at:nth-child(1)') %>%
      html_text(.)
    eps_description <- page %>%
      html_nodes(.,'.tmsg') %>%
      html_text(.)
    employees <- page %>%
      html_nodes(.,'.at:nth-child(2)') %>%
      html_text(.)
    category <- page %>%
      html_nodes(.,'.at:nth-child(3)') %>%
      html_text(.)
    salary <- page %>%
      html_nodes(.,'.cn strong') %>%
      html_text(.)
    url <- to_do[i]
    id <- i
    current <- tibble(position,resume,job_offer,eps,eps_description,eps_status,
                      employees,category,salary,url,id)
    df <- bind_rows(df,current)
    print(df[i,])
    i = i + 1
  }
  return(df)
}

我遇到的问题是,每当我中断循环或发生错误时,我都会丢失所有数据。有人能解决这个问题吗?我尝试了几件事,例如 safely 或 tryCatch,但我无法理解这里出了什么问题。 非常感谢。

编辑: 我还尝试使用 tryCatch。使用下面的函数,代码不会在遇到问题时中断(例如 HTTP 404 错误)。但是,当出现错误时,循环将停留在有问题的迭代中,因此我必须使用它。

dl_data_continue_2 <- function(to_do,df,done){
  i = (match(tail(done,n=1),to_do) + 1)
  while (to_do[i] %in% to_do) {
    tryCatch(
      {expr =
        page <- read_html(to_do[i])
      position <- page %>%
        html_nodes(.,'h1') %>%
        html_text(.)
      resume <- page %>%
        html_nodes(.,'.ltype') %>%
        html_text(.)
      job_offer <- page %>%
        html_nodes(.,'.job_msg') %>%
        html_text(.)
      eps <- page %>%
        html_nodes(.,'.com_msg') %>%
        html_text(.)
      eps_status <- page %>%
        html_nodes(.,'.at:nth-child(1)') %>%
        html_text(.)
      eps_description <- page %>%
        html_nodes(.,'.tmsg') %>%
        html_text(.)
      employees <- page %>%
        html_nodes(.,'.at:nth-child(2)') %>%
        html_text(.)
      category <- page %>%
        html_nodes(.,'.at:nth-child(3)') %>%
        html_text(.)
      salary <- page %>%
        html_nodes(.,'.cn strong') %>%
        html_text(.)
      url <- to_do[i]
      id <- i
      current <- tibble(position,resume,job_offer,eps,eps_description,eps_status,
                        employees,category,salary,url,id)
      df <- bind_rows(df,current)
      print(df[i,])
      i = i + 1},
      error = function(e){
        message("* Caught an error on itertion ")
        print(e)
        i = i + 1
      }
    )
  }
  out
}

安全使用,我基本都试过了

library(purrr)
dl_safely <- safely(dl_data_continue)

我在网页抓取中经常遇到这个问题。关键是将中间结果存储在一个环境中,如果您的函数抛出错误,则可以访问它们。明显的地方是全球环境,但这取决于您如何使用您的功能。如果它是包的一部分,那么您不想写入全局工作区。在这种情况下,您可以将 "storage" 环境作为包的一部分。

也许最简洁的方法是在循环完成后删除中间对象,这样它只会在循环抛出错误时可见/可访问。

这里有一个函数演示原理:

write_data_frames <- function(n)
{
  if(!exists("temporary", .GlobalEnv))
  {
    assign("temporary", list(), envir = globalenv())
    i <- 1
  }
  else
  {
    i <- length(.GlobalEnv$temporary) + 1
  }

  while(i <= n)
  {
    # This is the block where you do your web scraping and store the result
    .GlobalEnv$temporary[[i]] <- data.frame(var1 = rnorm(1), var2 = runif(1))

    # We'll create an error when i == 4
    if(i == 4) stop("Something broke!")
    i <- i + 1
  }
  result <- do.call(rbind, temporary)
  rm("temporary", envir = globalenv())
  return(result)
}

现在,如果我要求 3 行,这应该 return 一个不错的数据框:

write_data_frames(3)
#>         var1      var2
#> 1 -1.6428100 0.1976913
#> 2  0.7136643 0.9684348
#> 3 -0.4845004 0.0294557

它没有在我们的全局工作区中留下任何东西:

ls()
#> [1] "write_data_frames"

但是假设我要求十行:在这里,它会在第四个循环中抛出错误:

write_data_frames(10)
#> Error in write_data_frames(10) : Something broke!

然而,这次对象 temporary 对我可用:

ls()
#> [1] "temporary"         "write_data_frames"

temporary
#> [[1]]
#>       var1      var2
#> 1 -1.46648 0.1748874
#> 
#> [[2]]
#>          var1      var2
#> 1 -0.03855686 0.5772731
#> 
#> [[3]]
#>        var1      var2
#> 1 0.8228591 0.4115181
#> 
#> [[4]]
#>        var1      var2
#> 1 0.9183934 0.2732575

更妙的是,我的功能设计为从中断的地方继续,所以如果我再次这样做

write_data_frames(10)
#>           var1      var2
#> 1  -1.46647987 0.1748874
#> 2  -0.03855686 0.5772731
#> 3   0.82285907 0.4115181
#> 4   0.91839339 0.2732575
#> 5   0.54850658 0.9946303
#> 6  -1.39917426 0.9948544
#> 7   0.39525152 0.9234611
#> 8  -1.05899076 0.6226182
#> 9  -2.03137464 0.1218762
#> 10  0.24880216 0.6631982

函数从位置5重新开始,没有任何修改。现在,当我们检查我们的全局工作区时,什么都没有了:

ls()
#> [1] "write_data_frames"