尝试使用 foreach 和 %dopar% 进行网络抓取时出现错误消息

Question

我正在尝试从包含多个页面的网页中抓取一些数据（该代码旨在从所有页面中抓取，这些页面由网站上的“下一步”按钮分隔），但我收到了错误“summary.connection（连接）错误：无效连接” 当我运行以下代码时：

###(b)

#the data that I selected for this problem is IMDB rankings of 
#actors by current popularity (i.e. how many hits they are getting on 
#their page right now). I scrapped the data for the actors' names, the 
#movie title that IMDB identifies them with, and their rank

baseUrl <- "https://www.imdb.com/search/name/?gender=male,female&ref_=,%20desc&start="
startTime.5<- Sys.time()

dat2 <- foreach(i=0:122366, .combine=rbind) %dopar% {
  
  #Tell R to load the relevant package every iteration again. 
  #so it will be loaded for every new thread to use:
  library(rvest)
  
  url <- paste0(baseUrl, i*50 + 1)
  sourceCode <- read_html(url) # read source of current URL
  
  # scrape actor/actress name:
  #HTML elements that correspond to the area with the
  #actors' names are <class="lister-item header"> 
  #Extract all such nodes from the source code
  actorNodes <- html_nodes(sourceCode, ".lister-item-header")
  #extract lower level nodes
  actorAreas <- html_nodes(actorNodes, "a")
  #extract the text that lays between <a href=...> and </a> 
  actor <- html_text(actorAreas)
  #clean up the name by removing the \n at the end 
  actor <- gsub("\n", "", actor)
  
  # scrape movie name
  #HTML elements that correspond to the area with the
  #movie title the actor is noted for are <class="lister-item header"> 
  #Extract all such nodes from the source code
  movieNodes<- html_nodes(sourceCode, ".text-muted.text-small")
  #extract lower level nodes
  movieAreas <- html_nodes(movieNodes, "a")
  #extract the text that lays between <a href=...> and </a> 
  movies<- html_text(movieAreas)
  
  # scrape actor/actress rank
  #HTML elements that correspond to the area with the
  #actors' popularity on IMDB rank are <class="lister-item header"> 
  #Extract all such nodes from the source code
  rankNodes<- html_nodes(sourceCode, ".lister-item-header")
  #extract lower level nodes
  rankAreas<- html_nodes(rankNodes, ".lister-item-index.unbold.text-primary")
  #extract the text that lays between <a href=...> and </a> 
  rank<- html_text(rankAreas)
  #clean up the rank by removing the period and making it numeric
  rank<- gsub("\.", "", rank)
  rank<- as.numeric(rank)

  # create a data.frame with data scraped from current URL:
  actorData <- data.frame(actor=actor, movie=movies, rank=rank)
  actorData
  
}

startTime.5 <- Sys.time() - startTime.5 # how long did it take to scrape desired info form 767 series?
startTime.5 # more than 5 times faster!
# let's see if it worked:
View(dat2)
dim(dat2)

有什么问题吗？

Answer 1

在前几页的 lapply 循环中使用您的代码对我有用。

library(rvest)
baseUrl <- "https://www.imdb.com/search/name/?gender=male,female&ref_=,%20desc&start="

do.call(rbind, lapply(0:5, function(i) {
  

url <- paste0(baseUrl, i*50 + 1)
sourceCode <- read_html(url) # read source of current URL

# scrape actor/actress name:
#HTML elements that correspond to the area with the
#actors' names are <class="lister-item header"> 
#Extract all such nodes from the source code
actorNodes <- html_nodes(sourceCode, ".lister-item-header")
#extract lower level nodes
actorAreas <- html_nodes(actorNodes, "a")
#extract the text that lays between <a href=...> and </a> 
actor <- html_text(actorAreas)
#clean up the name by removing the \n at the end 
actor <- gsub("\n", "", actor)

# scrape movie name
#HTML elements that correspond to the area with the
#movie title the actor is noted for are <class="lister-item header"> 
#Extract all such nodes from the source code
movieNodes<- html_nodes(sourceCode, ".text-muted.text-small")
#extract lower level nodes
movieAreas <- html_nodes(movieNodes, "a")
#extract the text that lays between <a href=...> and </a> 
movies<- html_text(movieAreas)

# scrape actor/actress rank
#HTML elements that correspond to the area with the
#actors' popularity on IMDB rank are <class="lister-item header"> 
#Extract all such nodes from the source code
rankNodes<- html_nodes(sourceCode, ".lister-item-header")
#extract lower level nodes
rankAreas<- html_nodes(rankNodes, ".lister-item-index.unbold.text-primary")
#extract the text that lays between <a href=...> and </a> 
rank<- html_text(rankAreas)
#clean up the rank by removing the period and making it numeric
rank<- gsub("\.", "", rank)
rank<- as.numeric(rank)

# create a data.frame with data scraped from current URL:
actorData <- data.frame(actor=actor, movie=movies, rank=rank)
actorData

})) -> result

result

尝试使用 foreach 和 %dopar% 进行网络抓取时出现错误消息

Error message trying to web scrape using foreach and %dopar%

parallel-processing

foreach

r

web-scraping

rvest