使用 rvest 进行网络抓取:用 NA 替换 html_nodes 的缺失值

webscraping with rvest: Replace missing values of html_nodes with NA

我抓取 this 页面以获取列出的每个人的 (1) 姓名,(2) roles/editorial 头衔,以及 (3) 机构隶属关系那里。

问题是有些人没有机构隶属关系。我想用 NA 替换这些缺失值,但是 none 我的尝试已经奏效了。

非常感谢您的帮助!到目前为止,这是我的代码:

    journal_url <- "https://www.journals.elsevier.com/academic-pediatrics/editorial-board"
    webpage <- xml2::read_html(journal_url)
    webpage <- rvest::html_nodes(webpage, "div.publication-editors")

    editorsnodes <- rvest::html_children(webpage)

    titlesnodesnum <- which(rvest::html_attr(editorsnodes, "class") == "publication-editor-type")
    titles <- editorsnodes[titlesnodesnum]
    titles <- rvest::html_text(titles)
    titles <- trimws(titles)
    titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #identify the last record

    editors <- lapply(2:length(titlesnodesnum), function(n){
      start<- titlesnodesnum[n-1]+1  #starting node in subcategory
      end <- titlesnodesnum [n]-1   #ending node in subcategory
      names <- editorsnodes[start:end]
      names <- rvest::html_nodes(names, "div.publication-editor-name")
      names <- rvest::html_text(names)
      names <- trimws(names)
    })

我的主要尝试是在 editors <- lapply([...]) 部分插入一个 for 循环,类似 if(length(names) == 0) names <- NA,但没有任何效果。

P. S. 我的数据结构可能看起来很复杂,但为此我需要保留嵌套列表的结构(有关背景,请参阅我之前发布的 - 以及我从中获得的大部分代码) .

附加的子例程将提取网页上列出的人员的隶属关系(如果有)。对于那些没有隶属关系的人,代码将插入一个“NA”。您的代码的问题之一是名称没有抓取节点“span.publication-editor-affiliation”。我还使用“is_empty()”来检测是否没有列出从属关系。

affiliations <- lapply(2:length(titlesnodesnum), function(n){
  start<- titlesnodesnum[n-1]+1  #starting node in subcategory
  end <- titlesnodesnum [n]-1   #ending node in subcategory
  affiliations <- editorsnodes[start:end]
  affiliations <- rvest::html_nodes(affiliations, "span.publication-editor-affiliation")
  affiliations <- rvest::html_text(affiliations)
  if (purrr::is_empty(affiliations)){affiliations=NA} 
  affiliation <- trimws(affiliations)
})

我找到了解决办法。我用toString把xml节点集改成了一个字符串,把所有<div class="publication-editor">都提取出来,检查是否每个都有一个<span class="publication-editor-affiliation">;当他们不这样做时,lapplystr_extract 的组合导致 NA.

这是代码,供记录。

    affiliations <- lapply(2:length(titlesnodesnum), function(n){
      start<- titlesnodesnum[n-1]+1  #starting node in subcategory
      end <- titlesnodesnum [n]-1   #ending node in subcategory
      affiliations <- toString(editorsnodes[start:end])
      affiliations <- stringr::str_extract_all(affiliations, "(?<=<div class=\"publication-editor\")[\S\s]*?(?=<div class=\"clearfix\">)")
      affiliations <- lapply(affiliations, function(x) stringr::str_extract(x, "(?<=<span class=\"publication-editor-affiliation\" itemprop=\"affiliation\">).*?(?=</span>)"))
    })