使用 rvest 进行网络抓取:用 NA 替换 html_nodes 的缺失值
webscraping with rvest: Replace missing values of html_nodes with NA
我抓取 this 页面以获取列出的每个人的 (1) 姓名,(2) roles/editorial 头衔,以及 (3) 机构隶属关系那里。
问题是有些人没有机构隶属关系。我想用 NA
替换这些缺失值,但是 none 我的尝试已经奏效了。
非常感谢您的帮助!到目前为止,这是我的代码:
journal_url <- "https://www.journals.elsevier.com/academic-pediatrics/editorial-board"
webpage <- xml2::read_html(journal_url)
webpage <- rvest::html_nodes(webpage, "div.publication-editors")
editorsnodes <- rvest::html_children(webpage)
titlesnodesnum <- which(rvest::html_attr(editorsnodes, "class") == "publication-editor-type")
titles <- editorsnodes[titlesnodesnum]
titles <- rvest::html_text(titles)
titles <- trimws(titles)
titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #identify the last record
editors <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n]-1 #ending node in subcategory
names <- editorsnodes[start:end]
names <- rvest::html_nodes(names, "div.publication-editor-name")
names <- rvest::html_text(names)
names <- trimws(names)
})
我的主要尝试是在 editors <- lapply([...])
部分插入一个 for
循环,类似 if(length(names) == 0) names <- NA
,但没有任何效果。
P. S. 我的数据结构可能看起来很复杂,但为此我需要保留嵌套列表的结构(有关背景,请参阅我之前发布的 - 以及我从中获得的大部分代码) .
附加的子例程将提取网页上列出的人员的隶属关系(如果有)。对于那些没有隶属关系的人,代码将插入一个“NA”。您的代码的问题之一是名称没有抓取节点“span.publication-editor-affiliation”。我还使用“is_empty()”来检测是否没有列出从属关系。
affiliations <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n]-1 #ending node in subcategory
affiliations <- editorsnodes[start:end]
affiliations <- rvest::html_nodes(affiliations, "span.publication-editor-affiliation")
affiliations <- rvest::html_text(affiliations)
if (purrr::is_empty(affiliations)){affiliations=NA}
affiliation <- trimws(affiliations)
})
我找到了解决办法。我用toString
把xml节点集改成了一个字符串,把所有<div class="publication-editor">
都提取出来,检查是否每个都有一个<span class="publication-editor-affiliation">
;当他们不这样做时,lapply
和 str_extract
的组合导致 NA
.
这是代码,供记录。
affiliations <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n]-1 #ending node in subcategory
affiliations <- toString(editorsnodes[start:end])
affiliations <- stringr::str_extract_all(affiliations, "(?<=<div class=\"publication-editor\")[\S\s]*?(?=<div class=\"clearfix\">)")
affiliations <- lapply(affiliations, function(x) stringr::str_extract(x, "(?<=<span class=\"publication-editor-affiliation\" itemprop=\"affiliation\">).*?(?=</span>)"))
})
我抓取 this 页面以获取列出的每个人的 (1) 姓名,(2) roles/editorial 头衔,以及 (3) 机构隶属关系那里。
问题是有些人没有机构隶属关系。我想用 NA
替换这些缺失值,但是 none 我的尝试已经奏效了。
非常感谢您的帮助!到目前为止,这是我的代码:
journal_url <- "https://www.journals.elsevier.com/academic-pediatrics/editorial-board"
webpage <- xml2::read_html(journal_url)
webpage <- rvest::html_nodes(webpage, "div.publication-editors")
editorsnodes <- rvest::html_children(webpage)
titlesnodesnum <- which(rvest::html_attr(editorsnodes, "class") == "publication-editor-type")
titles <- editorsnodes[titlesnodesnum]
titles <- rvest::html_text(titles)
titles <- trimws(titles)
titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #identify the last record
editors <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n]-1 #ending node in subcategory
names <- editorsnodes[start:end]
names <- rvest::html_nodes(names, "div.publication-editor-name")
names <- rvest::html_text(names)
names <- trimws(names)
})
我的主要尝试是在 editors <- lapply([...])
部分插入一个 for
循环,类似 if(length(names) == 0) names <- NA
,但没有任何效果。
P. S. 我的数据结构可能看起来很复杂,但为此我需要保留嵌套列表的结构(有关背景,请参阅我之前发布的
附加的子例程将提取网页上列出的人员的隶属关系(如果有)。对于那些没有隶属关系的人,代码将插入一个“NA”。您的代码的问题之一是名称没有抓取节点“span.publication-editor-affiliation”。我还使用“is_empty()”来检测是否没有列出从属关系。
affiliations <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n]-1 #ending node in subcategory
affiliations <- editorsnodes[start:end]
affiliations <- rvest::html_nodes(affiliations, "span.publication-editor-affiliation")
affiliations <- rvest::html_text(affiliations)
if (purrr::is_empty(affiliations)){affiliations=NA}
affiliation <- trimws(affiliations)
})
我找到了解决办法。我用toString
把xml节点集改成了一个字符串,把所有<div class="publication-editor">
都提取出来,检查是否每个都有一个<span class="publication-editor-affiliation">
;当他们不这样做时,lapply
和 str_extract
的组合导致 NA
.
这是代码,供记录。
affiliations <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n]-1 #ending node in subcategory
affiliations <- toString(editorsnodes[start:end])
affiliations <- stringr::str_extract_all(affiliations, "(?<=<div class=\"publication-editor\")[\S\s]*?(?=<div class=\"clearfix\">)")
affiliations <- lapply(affiliations, function(x) stringr::str_extract(x, "(?<=<span class=\"publication-editor-affiliation\" itemprop=\"affiliation\">).*?(?=</span>)"))
})