R 读取并解析 HTML 以列出

Question

我一直在尝试阅读和解析一些 HTML 以获得动物收容所的动物条件列表。我敢肯定，我对 HTML 解析的经验不足并没有帮助，但我似乎进展得并不快。

这是 HTML 的片段：

<select multiple="true" name="asilomarCondition" id="asilomarCondition">

    <option value="101">
        Behavior- Aggression, Confrontational-Toward People (mild)
        -
        TM</option>
....
</select>

只有一个标签是<select...>，其余都是<option value=x>。

我一直在使用 XML 库。我可以删除换行符和制表符，但没有成功删除标签：

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n")
conditions.text <- gsub('[\t\n]',"",conditions.html)

作为最终结果，我想要一个所有条件的列表，我可以进一步处理这些条件以供以后用作因子名称：

Behavior- Aggression, Confrontational-Toward People (mild)-TM
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU
...

我不确定我是否需要使用 XML 库（或其他库）或者 gsub 模式是否足够（无论哪种方式，我都需要弄清楚如何使用它）。

Answer 1

这里是使用 rvest 包的开始：

library(rvest)
#read the html page
page<-read_html("test.html")
#get the text from the "option" nodes and then trim the whitespace
nodes<-trimws(html_text(html_nodes(page, "option")))

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters
nodes<-gsub("\n", "", nodes)
nodes<-gsub("  ", "", nodes)

矢量节点应该是您请求的结果。此示例基于上面提供的有限示例，实际页面可能会有意想不到的结果。

R 读取并解析 HTML 以列出

R Read & Parse HTML to List

r

html-parsing