将嵌套 for 循环转换为函数

Question

我写了一个嵌套的 for 循环，想把它转换成一个函数。这是我目前拥有的代码：

## Sleeper function
testit <- function(x)
{
  p1 <- proc.time()
  Sys.sleep(x)
  proc.time() - p1 # The cpu usage should be negligible
}

## Set up null objects to fill in loop
episodes = NULL
l = NULL

## Scrape pages and append together
for (season in c(27:38)){
  for (episode in c(1:13)){
    tryCatch({
      if (season == 34 & episode == 3){
        l = 'l'
      }
      new_episode <- read_html(paste0('http://www.chakoteya.net/DoctorWho/', season, '-', episode, '.htm', l)) %>% 
      html_nodes("p") %>%
      html_text() %>% 
      tibble(value = .)
      episodes <- episodes %>% bind_rows(new_episode)
      cat(paste('\rseason = ', season, '; episode = ', episode))
      testit(2)
    }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
  }
}

如您所见，它在第 27-38 季和每个季的第 1-13 集中循环播放。有时一季只有 12 集，因此 tryCatch()。而且，由于我不明白的原因，有时 URL 需要一个 .html 后缀（如果季节 >= 34 和剧集 >= 3），有时它需要一个 .htm 后缀，因此if (season == 34 & episode == 3) 语句。

我想将它转换成一个函数，可能使用 apply() 或 map()，但我的函数技能仍然很初级，我正在努力。

作为最终输出，调用一个名为 doctor_who() 的函数会很棒：

episodes <- doctor_who(season = c(27:38), episode = c(1:13))

Answer 1

是这样的吗？

library(dplyr)
library(xml2)

testit <- function(x) {
  p1 <- proc.time()
  Sys.sleep(x)
  proc.time() - p1 # The cpu usage should be negligible
}

doctor_who <- function(season = 27:38, episode = 1:13){
  ## Set up null objects to fill in loop
  episodes = NULL
  l = NULL
  
  res <- vector("list", length = length(season)* length(episodes))
  inx <- 0L
  ## Scrape pages and append together
  for (s in season){
    for (e in episode){
      tryCatch({
        if (s == 34 && e == 3){
          l = 'l'
        }
        URL <- paste0('http://www.chakoteya.net/DoctorWho/', s, '-', e, '.htm', l)
        new_episode <- read_html(URL, encoding = "shift_jis") %>% 
          html_nodes("p") %>%
          html_text() %>% 
          tibble(value = .)
        episodes <- episodes %>% bind_rows(new_episode)
        message(paste('season = ', s, '; episode = ', e))
        testit(2)
        inx <- inx + 1L
        res[[inx]] <- episodes
      }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
    }
  }
  res
}

episodes <- doctor_who(season = 27:38, episode = 1:13)

# how many were read
length(episodes)
#[1] 137

# how many were expected
length(27:38)*length(1:13)
#[1] 156

Answer 2

我会把它分成两个函数-

第一个函数从一个文件中读取数据，第二个函数创建每个季节和剧集的所有 url 并使用 map_df 一个一个地传递它们。

在第一个函数中，我们首先尝试从 .htm 读取数据，如果失败则从 .html.

读取数据

read_text <- function(url) {
  val <- tryCatch({
    read_html(url) %>%
      html_nodes("p") %>%
      html_text() %>% 
      tibble(value = .)
  }, error = function(e) return(NA))
  
  if(NROW(val) == 1) {
    url <- paste0(url, 'l')
    val <- tryCatch({
      read_html(url) %>%
        html_nodes("p") %>%
        html_text() %>% 
        tibble(value = .)
    }, error = function(e) return(NA))
  }

  if(NROW(val) > 1) val else NULL
}


doctor_who <- function(season, episode) {
  all_urls <- sprintf('http://www.chakoteya.net/DoctorWho/%s.htm', 
                      c(t(outer(season, episode, paste, sep = '-'))))
  purrr::map_df(all_urls, read_text, .id = 'id')
}

res <- doctor_who(season = 1:5, episode = 1:5)

我添加了一个 id 列来区分每一集。

#   id    value                                                                      
#   <chr> <chr>                                                                      
# 1 1     " \r\nOriginal Airdate: 23 Nov, 1963"                                      
# 2 1     " [Coal Hill School corridor] "                                            
# 3 1     " (The bell is ringing for end of classes.)\r\nGIRL: Night, Miss Wright. \…
# 4 1     " [Laboratory] "                                                           
# 5 1     " (A man is tidying up after the class) \r\nIAN: Oh? Not gone yet? \r\nBAR…
# 6 1     " [Classroom] "                                                            
# 7 1     " (Susan is listening to guitar rock music on her\r\ntransistor radio. I'm…
# 8 1     " [Totter's Lane] "                                                        
# 9 1     " (Ian and Barbara are parked up.) \r\nBARBARA: Over there. \r\nIAN: We're…
#10 1     " [Memory - classroom] "                                                   
# … with 4,446 more rows

将嵌套 for 循环转换为函数

Convert nested for loop to a function

for-loop

r

function

web-scraping

rvest