如何在不转换为字符串的情况下在 xpath 中找到所有 "non-parent" 节点?
How do i find all "non-parent" nodes in xpath without converting to string?
不久前我回答了我的问题:How do i find all nodes without children (starting from non-root node!) in xpath/R? 经过一些尝试。
但有时我会发现例外情况:
library(magrittr)
library(xml2)
url <- "https://kcsouthern.silkroad.com/epostings/index.cfm?fuseaction=app.jobsearch"
node <- url %>%
read_html %>%
html_nodes(xpath = "/html/body/div[1]/div/div[2]/div[3]/table/tr[2]")
我没有找到所有没有这样子节点的节点:
> node %>% html_nodes(xpath = "*//*[not(descendant::*)]")
{xml_nodeset (1)}
[1] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid=…
但是在转换为字符串和 "re-read" 之后 XML 我这样做了:
> node %>%
toString %>%
read_html %>%
html_nodes(xpath = "*//*[not(descendant::*)]")
{xml_nodeset (3)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid...
[3] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
编辑:关于 E. Wiest 的回答的进一步分析:
使用 XML 包:
> url %>%
+ GET %>%
+ content(as = "text") %>%
+ XML::htmlParse() %>%
+ XML::xpathSApply(path = "(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]")
[[1]]
<td align="center" class="cssSearchResultsBody">220359-021</td>
[[2]]
<a id="jobTitle_220359" href="....">SAP HR/Payroll Specialist</a>
[[3]]
<td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
现在与 xml2/rvest 等效:(似乎也有效)
> url %>%
+ read_html %>%
+ html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]//*[not(.//*)]")
{xml_nodeset (3)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid=220359&...
[3] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
问题似乎是,从非根节点开始搜索?
> url %>%
+ read_html %>%
+ html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
+ html_nodes(xpath = "*[not(.//*)]")
{xml_nodeset (2)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
我想问题与 xml2 或 rvest 有关。我在解析的对象 (htmlParse) 上直接使用 xpathSApply 得到 3 个结果。 XPath :
(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]
输出:
R代码:
library(httr)
library(XML)
page=GET("https://kcsouthern.silkroad.com/epostings/index.cfm?fuseaction=app.jobsearch")
parsed=htmlParse(content(page,as = "text"))
xpathSApply(parsed,"(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]")
编辑 2:事实上,根本没有问题。什么returns Rvest 就好了。它输出 XPath 表达式的含义。如果我们隔离第一个 tr 元素,我们有。
<tr class="cssSearchResultsHighlight">
<td align="center" class="cssSearchResultsBody">220359-021</td>
<td align="left" class="cssSearchResultsBody"><a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid=220359&source=ONLINE&JobOwner=992452&company_id=16021&version=1&byBusinessUnit=&bycountry=&bystate=&byRegion=&bylocation=&keywords=&byCat=&proximityCountry=&postalCode=&radiusDistance=&isKilometers=&tosearch=no&city=" class="cssSearchResultsBody">SAP HR/Payroll Specialist</a></td>
<td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
</tr>
以下代码将 return 从此 tr 得到 1 个结果(a 元素)(寻找一个元素,另一个元素(tr 的后代)的后代,并且没有子元素):
url %>%
read_html %>%
html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
html_nodes(xpath = "*//*[not(.//*)]")
以下代码将从这个 tr return 2 个结果(第一个和第三个 td 元素)(寻找一个元素,tr 的后代,并且没有子元素):
url %>%
read_html %>%
html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
html_nodes(xpath = "*[not(.//*)]")
以下代码将从这个 tr return 3 个结果(第一个和第三个 td 元素和 a 元素)(从 tr 开始,在任何地方寻找没有子元素的元素):
url %>%
read_html %>%
html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
html_nodes(xpath = ".//*[not(.//*)]")
代码 n°3 可能就是您正在寻找的。
旁注:不要忘记修复您的第一个 XPath 表达式:/html/body/div[1]/div/div[2]/div[3]/table/tr[2]
应该是 /html/body/div[1]/div/div[2]/div[3]/table//tr[2]
不久前我回答了我的问题:How do i find all nodes without children (starting from non-root node!) in xpath/R? 经过一些尝试。
但有时我会发现例外情况:
library(magrittr)
library(xml2)
url <- "https://kcsouthern.silkroad.com/epostings/index.cfm?fuseaction=app.jobsearch"
node <- url %>%
read_html %>%
html_nodes(xpath = "/html/body/div[1]/div/div[2]/div[3]/table/tr[2]")
我没有找到所有没有这样子节点的节点:
> node %>% html_nodes(xpath = "*//*[not(descendant::*)]")
{xml_nodeset (1)}
[1] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid=…
但是在转换为字符串和 "re-read" 之后 XML 我这样做了:
> node %>%
toString %>%
read_html %>%
html_nodes(xpath = "*//*[not(descendant::*)]")
{xml_nodeset (3)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid...
[3] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
编辑:关于 E. Wiest 的回答的进一步分析:
使用 XML 包:
> url %>%
+ GET %>%
+ content(as = "text") %>%
+ XML::htmlParse() %>%
+ XML::xpathSApply(path = "(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]")
[[1]]
<td align="center" class="cssSearchResultsBody">220359-021</td>
[[2]]
<a id="jobTitle_220359" href="....">SAP HR/Payroll Specialist</a>
[[3]]
<td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
现在与 xml2/rvest 等效:(似乎也有效)
> url %>%
+ read_html %>%
+ html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]//*[not(.//*)]")
{xml_nodeset (3)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid=220359&...
[3] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
问题似乎是,从非根节点开始搜索?
> url %>%
+ read_html %>%
+ html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
+ html_nodes(xpath = "*[not(.//*)]")
{xml_nodeset (2)}
[1] <td align="center" class="cssSearchResultsBody">220359-021</td>
[2] <td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
我想问题与 xml2 或 rvest 有关。我在解析的对象 (htmlParse) 上直接使用 xpathSApply 得到 3 个结果。 XPath :
(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]
输出:
R代码:
library(httr)
library(XML)
page=GET("https://kcsouthern.silkroad.com/epostings/index.cfm?fuseaction=app.jobsearch")
parsed=htmlParse(content(page,as = "text"))
xpathSApply(parsed,"(//tr[@class='cssSearchResultsHighlight'])[1]//*[not(.//*)]")
编辑 2:事实上,根本没有问题。什么returns Rvest 就好了。它输出 XPath 表达式的含义。如果我们隔离第一个 tr 元素,我们有。
<tr class="cssSearchResultsHighlight">
<td align="center" class="cssSearchResultsBody">220359-021</td>
<td align="left" class="cssSearchResultsBody"><a id="jobTitle_220359" href="index.cfm?fuseaction=app.jobinfo&jobid=220359&source=ONLINE&JobOwner=992452&company_id=16021&version=1&byBusinessUnit=&bycountry=&bystate=&byRegion=&bylocation=&keywords=&byCat=&proximityCountry=&postalCode=&radiusDistance=&isKilometers=&tosearch=no&city=" class="cssSearchResultsBody">SAP HR/Payroll Specialist</a></td>
<td align="left" class="cssSearchResultsBody">Kansas City, Missouri, United States</td>
</tr>
以下代码将 return 从此 tr 得到 1 个结果(a 元素)(寻找一个元素,另一个元素(tr 的后代)的后代,并且没有子元素):
url %>%
read_html %>%
html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
html_nodes(xpath = "*//*[not(.//*)]")
以下代码将从这个 tr return 2 个结果(第一个和第三个 td 元素)(寻找一个元素,tr 的后代,并且没有子元素):
url %>%
read_html %>%
html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
html_nodes(xpath = "*[not(.//*)]")
以下代码将从这个 tr return 3 个结果(第一个和第三个 td 元素和 a 元素)(从 tr 开始,在任何地方寻找没有子元素的元素):
url %>%
read_html %>%
html_nodes(xpath = "//tr[@class='cssSearchResultsHighlight'][1]") %>%
html_nodes(xpath = ".//*[not(.//*)]")
/html/body/div[1]/div/div[2]/div[3]/table/tr[2]
应该是 /html/body/div[1]/div/div[2]/div[3]/table//tr[2]