在 xpath 中找不到任何内容时如何 return NA?
How to return NA when nothing is found in an xpath?
这个问题很难表述,但是举个例子,很容易理解。
我使用 R 来解析 html 代码。
在下面,我有一个名为html
的html代码,然后我尝试提取//span[@class="number"]
中的所有值和//span[@class="surface"]
中的所有值:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'
page = htmlTreeParse(html,useInternal = TRUE,encoding="UTF-8")
number = unlist(xpathApply(page,'//span[@class="number"]',xmlValue))
surface = unlist(xpathApply(page,'//span[@class="surface"]',xmlValue))
number
的输出是:
[1] "Number 1"
surface
的输出是:
[1] "Surface 1" "Surface 2"
然后,当我尝试 cbind
这两个元素时,我做不到,因为它们的长度不同。
所以我的问题是:我该怎么做才能得到 number
的输出,即:
[1] "Number 1" NA
然后我可以组合number
和surface
。
library( 'XML' ) # load library
doc = htmlParse( html ) # parse html
# define xpath expression. div contains class = line, within which span has classes number and surface
xpexpr <- '//div[ @class = "line" ]'
a1 <- lapply( getNodeSet( doc, xpexpr ), function( x ) { # loop through nodeset
y <- xmlSApply( x, xmlValue, trim = TRUE ) # get xmlvalue
names(y) <- xmlApply( x, xmlAttrs ) # get xmlattributes and assign it as names to y
y # return y
} )
遍历 a1
并提取 number
和 surface
的值并相应地设置名称。然后列绑定数字和表面值
nm <- c( 'number', 'surface' )
do.call( 'cbind', lapply( a1, function( x ) setNames( x[ nm ], nm ) ) )
# [,1] [,2]
# number "Number 1" NA
# surface "Surface 1" "Surface 2"
数据:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'
更容易 select 每个的封闭标签(这里的 div
),并在里面寻找每个标签。使用 rvest 和 purrr,我觉得更简单,
library(rvest)
library(purrr)
html %>% read_html() %>%
html_nodes('.line') %>%
map_df(~list(number = .x %>% html_node('.number') %>% html_text(),
surface = .x %>% html_node('.surface') %>% html_text()))
#> # A tibble: 2 × 2
#> number surface
#> <chr> <chr>
#> 1 Number 1 Surface 1
#> 2 <NA> Surface 2
这个问题很难表述,但是举个例子,很容易理解。
我使用 R 来解析 html 代码。
在下面,我有一个名为html
的html代码,然后我尝试提取//span[@class="number"]
中的所有值和//span[@class="surface"]
中的所有值:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'
page = htmlTreeParse(html,useInternal = TRUE,encoding="UTF-8")
number = unlist(xpathApply(page,'//span[@class="number"]',xmlValue))
surface = unlist(xpathApply(page,'//span[@class="surface"]',xmlValue))
number
的输出是:
[1] "Number 1"
surface
的输出是:
[1] "Surface 1" "Surface 2"
然后,当我尝试 cbind
这两个元素时,我做不到,因为它们的长度不同。
所以我的问题是:我该怎么做才能得到 number
的输出,即:
[1] "Number 1" NA
然后我可以组合number
和surface
。
library( 'XML' ) # load library
doc = htmlParse( html ) # parse html
# define xpath expression. div contains class = line, within which span has classes number and surface
xpexpr <- '//div[ @class = "line" ]'
a1 <- lapply( getNodeSet( doc, xpexpr ), function( x ) { # loop through nodeset
y <- xmlSApply( x, xmlValue, trim = TRUE ) # get xmlvalue
names(y) <- xmlApply( x, xmlAttrs ) # get xmlattributes and assign it as names to y
y # return y
} )
遍历 a1
并提取 number
和 surface
的值并相应地设置名称。然后列绑定数字和表面值
nm <- c( 'number', 'surface' )
do.call( 'cbind', lapply( a1, function( x ) setNames( x[ nm ], nm ) ) )
# [,1] [,2]
# number "Number 1" NA
# surface "Surface 1" "Surface 2"
数据:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'
更容易 select 每个的封闭标签(这里的 div
),并在里面寻找每个标签。使用 rvest 和 purrr,我觉得更简单,
library(rvest)
library(purrr)
html %>% read_html() %>%
html_nodes('.line') %>%
map_df(~list(number = .x %>% html_node('.number') %>% html_text(),
surface = .x %>% html_node('.surface') %>% html_text()))
#> # A tibble: 2 × 2
#> number surface
#> <chr> <chr>
#> 1 Number 1 Surface 1
#> 2 <NA> Surface 2