如何解析具有嵌套结构的 html 文件?
How to parse an html file with a nested structure?
使用 R 和 XML
包,我一直在尝试从 html 具有类似于以下结构的文件中提取地址:
<!DOCTYPE html>
<body>
<div class='entry'>
<span class='name'>Marcus Smith</span>
<span class='town'>New York</span>
<span class='phone'>123456789</span>
</div>
<div class='entry'>
<span class='name'>Henry Higgins</span>
<span class='town'>London</span>
</div>
<div class='entry'>
<span class='name'>Paul Miller</span>
<span class='town'>Boston</span>
<span class='phone'>987654321</span>
</div>
</body>
</html>
我先做以下事情
library(XML)
html <- htmlTreeParse("test.html", useInternalNodes = TRUE)
root <- xmlRoot(html)
现在,我可以用这个得到所有的名字:
xpathSApply(root, "//span[@class='name']", xmlValue)
## [1] "Marcus Smith" "Henry Higgins" "Paul Miller"
这个问题现在是因为某些元素没有出现在所有地址中。在示例中,这是 phone 数字:
xpathSApply(root, "//span[@class='phone']", xmlValue)
## [1] "123456789" "987654321"
如果我这样做,我就无法将 phone 号码分配给正确的人。所以,我尝试先提取整个地址簿条目如下:
divs <- getNodeSet(root, "//div[@class='entry']")
divs[[1]]
## <div class="entry">
## <span class="name">Marcus Smith</span>
## <span class="town">New York</span>
## <span class="phone">123456789</span>
## </div>
从输出中我认为我已经达到了我的目标并且我可以获得例如与第一个条目对应的名称如下:
xpathSApply(divs[[1]], "//span[@class='name']", xmlValue)
## [1] "Marcus Smith" "Henry Higgins" "Paul Miller"
但是即使 divs[[1]]
的输出只显示了 Marcus Smith
的数据,我还是得到了所有三个名字。
这是为什么?我必须做什么才能以这种方式提取地址数据,我知道 name
、town
和 phone
的哪些值属于一起?
也许 xpath 表达式有问题,“//”总是转到根元素?
此代码适用于测试数据:
one.entry <- function(x) {
name <- getNodeSet(x, "span[@class='name']")
phone <- getNodeSet(x, "span[@class='phone']")
town <- getNodeSet(x, "span[@class='town']")
name <- if(length(name)==1) xmlValue(name[[1]]) else NA
phone <- if(length(phone)==1) xmlValue(phone[[1]]) else NA
town <- if(length(town)==1) xmlValue(town[[1]]) else NA
return(data.frame(name=name, phone=phone, town=town, stringsAsFactors=F))
}
do.call(rbind, lapply(divs, one.entry))
如果每个条目的项目数量未知,您可以利用 dplyr::bind_rows
或 data.table::rbindlist
与 rvest
的组合,如下所示:
require(rvest)
require(dplyr)
# Little helper-function to extract all children and set Names
extract_info <- function(node){
child <- html_children(node)
as.list(setNames(child %>% html_text(), child %>% html_attr("class")))
}
doc <- read_html(txt)
doc %>% html_nodes(".entry") %>% lapply(extract_info) %>% bind_rows
给你:
name town phone
(chr) (chr) (chr)
1 Marcus Smith New York 123456789
2 Henry Higgins London NA
3 Paul Miller Boston 987654321
或者使用 rbindlist(fill=TRUE)
而不是 bind_rows
,这会导致 data.table
。或者使用 purrr
代替 map_df(as.list)
。
purrr
通过嵌套节点并将结果列表修改为 data.frame:
使 rvest
更有用
library(rvest)
library(purrr)
html %>% read_html() %>%
# select all entry divs
html_nodes('div.entry') %>%
# for each entry div, select all spans, keeping results in a list element
map(html_nodes, css = 'span') %>%
# for each list element, set the name of the text to the class attribute
map(~setNames(html_text(.x), html_attr(.x, 'class'))) %>%
# convert named vectors to list elements; convert list to a data.frame
map_df(as.list) %>%
# convert character vectors to appropriate types
dmap(type.convert, as.is = TRUE)
## # A tibble: 3 x 3
## name town phone
## <chr> <chr> <int>
## 1 Marcus Smith New York 123456789
## 2 Henry Higgins London NA
## 3 Paul Miller Boston 987654321
当然,您可以将所有 purrr
替换为 base,尽管这将需要更多步骤。
丑陋的基础 R+rvest 解决方案(但我作弊并使用管道来避免地狱般的嵌套括号或临时分配)来展示 ++gd @alistaire 的解决方案是怎样的:
library(rvest)
library(magrittr)
read_html("<!DOCTYPE html>
<body>
<div class='entry'>
<span class='name'>Marcus Smith</span>
<span class='town'>New York</span>
<span class='phone'>123456789</span>
</div>
<div class='entry'>
<span class='name'>Henry Higgins</span>
<span class='town'>London</span>
</div>
<div class='entry'>
<span class='name'>Paul Miller</span>
<span class='town'>Boston</span>
<span class='phone'>987654321</span>
</div>
</body>
</html>") -> pg
pg %>%
html_nodes('div.entry') %>%
lapply(html_nodes, css='span') %>%
lapply(function(x) {
setNames(html_text(x), html_attr(x, 'class')) %>%
as.list() %>%
as.data.frame(stringsAsFactors=FALSE)
}) %>%
lapply(., unlist) %>%
lapply("[", unique(unlist(c(sapply(., names))))) %>%
do.call(rbind, .) %>%
as.data.frame(stringsAsFactors=FALSE)
使用 R 和 XML
包,我一直在尝试从 html 具有类似于以下结构的文件中提取地址:
<!DOCTYPE html>
<body>
<div class='entry'>
<span class='name'>Marcus Smith</span>
<span class='town'>New York</span>
<span class='phone'>123456789</span>
</div>
<div class='entry'>
<span class='name'>Henry Higgins</span>
<span class='town'>London</span>
</div>
<div class='entry'>
<span class='name'>Paul Miller</span>
<span class='town'>Boston</span>
<span class='phone'>987654321</span>
</div>
</body>
</html>
我先做以下事情
library(XML)
html <- htmlTreeParse("test.html", useInternalNodes = TRUE)
root <- xmlRoot(html)
现在,我可以用这个得到所有的名字:
xpathSApply(root, "//span[@class='name']", xmlValue)
## [1] "Marcus Smith" "Henry Higgins" "Paul Miller"
这个问题现在是因为某些元素没有出现在所有地址中。在示例中,这是 phone 数字:
xpathSApply(root, "//span[@class='phone']", xmlValue)
## [1] "123456789" "987654321"
如果我这样做,我就无法将 phone 号码分配给正确的人。所以,我尝试先提取整个地址簿条目如下:
divs <- getNodeSet(root, "//div[@class='entry']")
divs[[1]]
## <div class="entry">
## <span class="name">Marcus Smith</span>
## <span class="town">New York</span>
## <span class="phone">123456789</span>
## </div>
从输出中我认为我已经达到了我的目标并且我可以获得例如与第一个条目对应的名称如下:
xpathSApply(divs[[1]], "//span[@class='name']", xmlValue)
## [1] "Marcus Smith" "Henry Higgins" "Paul Miller"
但是即使 divs[[1]]
的输出只显示了 Marcus Smith
的数据,我还是得到了所有三个名字。
这是为什么?我必须做什么才能以这种方式提取地址数据,我知道 name
、town
和 phone
的哪些值属于一起?
也许 xpath 表达式有问题,“//”总是转到根元素?
此代码适用于测试数据:
one.entry <- function(x) {
name <- getNodeSet(x, "span[@class='name']")
phone <- getNodeSet(x, "span[@class='phone']")
town <- getNodeSet(x, "span[@class='town']")
name <- if(length(name)==1) xmlValue(name[[1]]) else NA
phone <- if(length(phone)==1) xmlValue(phone[[1]]) else NA
town <- if(length(town)==1) xmlValue(town[[1]]) else NA
return(data.frame(name=name, phone=phone, town=town, stringsAsFactors=F))
}
do.call(rbind, lapply(divs, one.entry))
如果每个条目的项目数量未知,您可以利用 dplyr::bind_rows
或 data.table::rbindlist
与 rvest
的组合,如下所示:
require(rvest)
require(dplyr)
# Little helper-function to extract all children and set Names
extract_info <- function(node){
child <- html_children(node)
as.list(setNames(child %>% html_text(), child %>% html_attr("class")))
}
doc <- read_html(txt)
doc %>% html_nodes(".entry") %>% lapply(extract_info) %>% bind_rows
给你:
name town phone
(chr) (chr) (chr)
1 Marcus Smith New York 123456789
2 Henry Higgins London NA
3 Paul Miller Boston 987654321
或者使用 rbindlist(fill=TRUE)
而不是 bind_rows
,这会导致 data.table
。或者使用 purrr
代替 map_df(as.list)
。
purrr
通过嵌套节点并将结果列表修改为 data.frame:
rvest
更有用
library(rvest)
library(purrr)
html %>% read_html() %>%
# select all entry divs
html_nodes('div.entry') %>%
# for each entry div, select all spans, keeping results in a list element
map(html_nodes, css = 'span') %>%
# for each list element, set the name of the text to the class attribute
map(~setNames(html_text(.x), html_attr(.x, 'class'))) %>%
# convert named vectors to list elements; convert list to a data.frame
map_df(as.list) %>%
# convert character vectors to appropriate types
dmap(type.convert, as.is = TRUE)
## # A tibble: 3 x 3
## name town phone
## <chr> <chr> <int>
## 1 Marcus Smith New York 123456789
## 2 Henry Higgins London NA
## 3 Paul Miller Boston 987654321
当然,您可以将所有 purrr
替换为 base,尽管这将需要更多步骤。
丑陋的基础 R+rvest 解决方案(但我作弊并使用管道来避免地狱般的嵌套括号或临时分配)来展示 ++gd @alistaire 的解决方案是怎样的:
library(rvest)
library(magrittr)
read_html("<!DOCTYPE html>
<body>
<div class='entry'>
<span class='name'>Marcus Smith</span>
<span class='town'>New York</span>
<span class='phone'>123456789</span>
</div>
<div class='entry'>
<span class='name'>Henry Higgins</span>
<span class='town'>London</span>
</div>
<div class='entry'>
<span class='name'>Paul Miller</span>
<span class='town'>Boston</span>
<span class='phone'>987654321</span>
</div>
</body>
</html>") -> pg
pg %>%
html_nodes('div.entry') %>%
lapply(html_nodes, css='span') %>%
lapply(function(x) {
setNames(html_text(x), html_attr(x, 'class')) %>%
as.list() %>%
as.data.frame(stringsAsFactors=FALSE)
}) %>%
lapply(., unlist) %>%
lapply("[", unique(unlist(c(sapply(., names))))) %>%
do.call(rbind, .) %>%
as.data.frame(stringsAsFactors=FALSE)