如何 select R 中文本的一部分

Question

我有一个 HTML 文件，其中包含 5 篇不同的文章，我想在 R 中分别提取这些文章中的每一篇，并运行每篇文章进行一些分析。每篇文章以< doc>开头，以< /doc>结尾，还有一个文档number.Example:

<doc>
<docno> NA123455-0001 </docno>
<docid> 1 </docid>
<p>
NASA one-year astronaut Scott Kelly speaks after coming home to Houston on  
March 3, 2016. Behind Kelly, 
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in      
brother, Mark; 
John Holdren, Assistant to the President for Science and ...
</p>
</doc>
<doc>
<docno> KA25637-1215 </docno>
<docid> 65 </docid>
<date>
<p>
February 1, 2014, Sunday 
</p>
</date>
<section>
<p>
WASHINGTON -- Former Republican presidential nominee Mitt Romney 
is charging into the increasingly divisive 2016 GOP 
White House sweepstakes Thursday with a harsh takedown of front-runner 
Donald Trump, calling him a "phony" and exhorting fellow 
</p>
</type>
</doc>
<doc>
<docno> JN1234567-1225 </docno>
<docid> 67 </docid>
<date>
<p>
March 5, 2003
</p>
</date>
<section>
<p>
SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted 
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its 
military, even at the risk of economic collapse ...
</p>
</doc>

我已经使用 readLines() 函数上传了 url 并使用

将所有行合并在一起

 articles<- paste(articles, collapse=" ")

我想 select 第一篇文章 < doc>..< /doc> 并将其分配给 article1，第二篇文章分配给 article2 等等。

能否请您告知如何构造函数以便 select 这些文章中的每一篇？

Answer 1

您可以使用 strsplit，它将根据您提供的任何文本或正则表达式拆分字符串。它将为您提供一个列表，其中包含拆分字符串之间字符串的每个部分的一个项目，然后您可以根据需要将其子集化为不同的变量。（如果愿意，您也可以使用其他正则表达式函数。）

splitArticles <- strsplit(articles, '<doc>')

您仍然需要删掉 </doc> 标签（如果您只想要文本，还要加上很多其他的废话），但这是一个开始。

做同样事情的更典型的方法是使用 html scraping/parsing 的包。使用 rvest 包，你需要像

这样的东西

library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()

这将为您提供 <doc> 标签内容的字符向量。这可能需要更多的清理，尤其是当有需要清理的空白字符时。仔细选择 html_nodes 的选择器可能会帮助您避免其中的一些情况；看起来如果您使用 p 而不是 doc，您更有可能只获取文本。

Answer 2

最简单的解决方案是使用 strsplit:

art_list <- strsplit(s, "<doc>")
art_list <- art_list[art_list != ""]
ids <- gsub(".*<docid>|</docid>.*", "", art_list[[i]]  )
ids <- ids[ids != ""]
for (i in 1: length(unlist(art_list)) ){
assign( paste("article",  ids[i], sep = "_") ,  gsub(".*<doc>|</doc>.*", "", unlist(art_list)  )[i] )}

如何 select R 中文本的一部分

How to select part of a text in R

r

text-mining