R 解析网页中的不完整文本 (HTML)

Question

我正在尝试从多篇科学文章中解析纯文本以供后续文本分析。到目前为止，我使用了 R script by Tony Breyal based on the packages RCurl and XML. This works fine for all targeted journals, except for those published by http://www.sciencedirect.com。当我尝试解析来自 SD 的文章时（这对于我需要从 SD 访问的所有测试期刊都是一致的），R 中的文本对象仅将整个文档的第一部分存储在其中。不幸的是，我不太熟悉 html，但我认为问题应该出在 SD html 代码中，因为它适用于所有其他情况。我知道有些期刊不是开放访问的，但我有访问权限，问题也出现在开放访问的文章中（查看示例）。这是来自 Github:

的代码

 htmlToText <- function(input, ...) {
###---PACKAGES ---###
 require(RCurl)
 require(XML)


###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
 evaluate_input <- function(input) {    
# if input is a .html file
if(file.exists(input)) {
  char.vec <- readLines(input, warn = FALSE)
  return(paste(char.vec, collapse = ""))
}

# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)

# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
  # downolad SSL certificate in case of https problem
  if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
  return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}

# return NULL if none of the conditions above apply
return(NULL)
}

# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}

# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
 }

###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)

# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)

# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}

这是我的代码和示例文章：

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)

未格式化的文本在方法部分的某处停止：

DNA was extracted using the MasterPure™ Yeast DNA Purification Kit (Epicentre, Madison, Wisconsin, USA) following the manufacturer's instructions.

任何suggestions/ideas？

P.S。我也试过 html_text 基于 rvest 结果相同。

Answer 1

您可以直接使用现有代码，只需将 ?np=y 添加到 URL 的末尾，但这样更紧凑：

library(rvest)
library(stringi)

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"

pg <- read_html(target)
pg %>%
  html_nodes(xpath=".//div[@id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>% 
  stri_trim() %>% 
  paste0(collapse=" ") %>% 
  write(file="output.txt")

一些输出（那篇文章的总数 >80K）：

 Fungal Ecology Volume 22 , August 2016, Pages 61–72        175394|| Species richness 
 influences wine ecosystem function through a dominant species Primrose J. Boynton a , , , 
 Duncan Greig a , b a  Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany 
 b  The Galton Laboratory, Department of Genetics, Evolution, and Environment, University 
 College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016, 
 Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
 Davey Abstract Increased species richness does not always cause increased ecosystem function. 
 Instead, richness can influence individual species with positive or negative ecosystem effects. 
 We investigated richness and function in fermenting wine, and found that richness indirectly 
 affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae . 
 While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich 
 communities, probably because antagonistic species prevent it from growing. It is also diluted 
 from species-poor communities,

R 解析网页中的不完整文本 (HTML)

R Parses incomplete text from webpages (HTML)

html

xml

r

text-mining

rvest