如何在不转换为字符串的情况下在 xpath 中找到所有 "non-parent" 节点?

How do i find all "non-parent" nodes in xpath without converting to string?

不久前我回答了我的问题:How do i find all nodes without children (starting from non-root node!) in xpath/R? 经过一些尝试。

但有时我会发现例外情况:

library(magrittr)
library(xml2)
url <- "https://kcsouthern.silkroad.com/epostings/index.cfm?fuseaction=app.jobsearch"
node <- url %>% 
   read_html %>% 
   html_nodes(xpath = "/html/body/div[1]/div/div[2]/div[3]/table/tr[2]")

我没有找到所有没有这样子节点的节点:

> node %>% html_nodes(xpath = "*//*[not(descendant::*)]")
{xml_nodeset (1)}
[1] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&amp;jobid=…

但是在转换为字符串和 "re-read" 之后 XML 我这样做了:

> node %>% 
     toString %>% 
     read_html %>% 
     html_nodes(xpath = "*//*[not(descendant::*)]")
{xml_nodeset (3)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&amp;jobid...
[3] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>

编辑:关于 E. Wiest 的回答的进一步分析:

使用 XML 包:

> url %>% 
+   GET %>% 
+   content(as = "text") %>% 
+   XML::htmlParse() %>% 
+   XML::xpathSApply(path = "(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]")
[[1]]
<td align="center" class="cssSearchResultsBody">220359-021</td> 

[[2]]
<a id="jobTitle_220359" href="....">SAP HR/Payroll Specialist</a> 

[[3]]
<td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td> 

现在与 xml2/rvest 等效:(似乎也有效)

> url %>% 
+   read_html %>% 
+   html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]//*[not(.//*)]")
{xml_nodeset (3)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&amp;jobid=220359&amp...
[3] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>

问题似乎是,从非根节点开始搜索?

> url %>% 
+   read_html %>% 
+   html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>% 
+   html_nodes(xpath = "*[not(.//*)]")
{xml_nodeset (2)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>

我想问题与 xml2 或 rvest 有关。我在解析的对象 (htmlParse) 上直接使用 xpathSApply 得到 3 个结果。 XPath :

(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]

输出:

R代码:

library(httr)
library(XML)
page=GET("https://kcsouthern.silkroad.com/epostings/index.cfm?fuseaction=app.jobsearch")
parsed=htmlParse(content(page,as = "text"))
xpathSApply(parsed,"(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]")

编辑 2:事实上,根本没有问题。什么returns Rvest 就好了。它输出 XPath 表达式的含义。如果我们隔离第一个 tr 元素,我们有。

<tr class="cssSearchResultsHighlight">
<td align="center" class="cssSearchResultsBody">220359-021</td>
<td align="left" class="cssSearchResultsBody"><a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&amp;jobid=220359&amp;source=ONLINE&amp;JobOwner=992452&amp;company_id=16021&amp;version=1&amp;byBusinessUnit=&amp;bycountry=&amp;bystate=&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proximityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&amp;tosearch=no&amp;city=" class="cssSearchResultsBody">SAP HR/Payroll Specialist</a></td>
<td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td> 
</tr>

以下代码将 return 从此 tr 得到 1 个结果(a 元素)(寻找一个元素,另一个元素(tr 的后代)的后代,并且没有子元素):

url %>%
  read_html %>%
  html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
  html_nodes(xpath = "*//*[not(.//*)]")

以下代码将从这个 tr return 2 个结果(第一个和第三个 td 元素)(寻找一个元素,tr 的后代,并且没有子元素):

url %>%
  read_html %>%
  html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
  html_nodes(xpath = "*[not(.//*)]")

以下代码将从这个 tr return 3 个结果(第一个和第三个 td 元素和 a 元素)(从 tr 开始,在任何地方寻找没有子元素的元素):

url %>%
  read_html %>%
  html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
  html_nodes(xpath = ".//*[not(.//*)]")

代码 n°3 可能就是您正在寻找的。 旁注:不要忘记修复您的第一个 XPath 表达式:/html/body/div[1]/div/div[2]/div[3]/table/tr[2] 应该是 /html/body/div[1]/div/div[2]/div[3]/table//tr[2]