如何使用 XPath 仅获取包含其他元素的元素的文本？

Question

我正在使用 Nokogiri 和 XPath 解析文档。我对结构为：

的列表的内容感兴趣

<ul>
  <li>
    <div>
      <!-- Some data I'm not interested in -->
    </div>
    <span>
      <a href="some_url">A name I already got easily</a>
      <br>
      Some text I need to get but just can't
    </span>
  </li>
  <li>
    <div>
      <!-- Some data I'm not interested in again -->
    </div>
    <span>
      <a href="some_other_url">Another name I already got easily</a>
      <br>
      Some other text I need to get but just can't
    </span>
  </li>
  .
  .
  .
</ul>

我正在使用：

politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
  politician = OpenStruct.new
  politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
  politician.url = row.at_xpath('span/a/@href').to_s.strip
  politician.party = row.at_xpath('span').to_s.strip
  politicians.push(politician)
end

这适用于 politician.name 和 politician.url，但当涉及到 politician.party 时，即   标记后的文本，我无法隔离文本。使用

row.at_xpath('span').to_s.strip

给出  标签的所有内容，包括其他 HTML 元素。

关于如何获取此文本有什么建议吗？

Answer 1

span/text() return 为空，因为  中的第一个文本节点是位于 span 开始标记和 <a/> 元素之间的空格（换行符和空格） .尝试改用以下 XPath：

span/text()[normalize-space()]

此 XPath 应该 return 是 

的直接子节点的非空文本节点

Answer 2

我会这样做：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span>
  <a href="some_other_url">Another name I already got easily</a>
  <br>
  Some other text I need to get but just can't
</span>
EOT

doc.at('span br').next.text # => "\n  Some other text I need to get but just can't\n"

或

doc.at('//span/br').next.text # => "\n  Some other text I need to get but just can't\n"

清理生成的字符串很容易：

"\n  Some other text I need to get but just can't\n".strip # => "Some other text I need to get but just can't"

你的代码存在的问题是你没有深入研究 DOM 来得到你想要的，而且你做错了事：

doc.at_xpath('//span').to_s # => "<span>\n  <a href=\"some_other_url\">Another name I already got easily</a>\n  <br>\n  Some other text I need to get but just can't\n</span>"

to_s 与 to_html 相同，return 是原始标记中的节点。使用 text 会去掉标签，这会让你离得更近，但是，同样，你站得太远了：

doc.at_xpath('//span').text # => "\n  Another name I already got easily\n  \n  Some other text I need to get but just can't\n"

因为 不是一个容器你不能得到它的文本，但是你仍然可以用它来导航，然后得到next节点，也就是文本节点，并且检索它：

doc.at('span br').next.class # => Nokogiri::XML::Text

解析XML/HTML时，指向你想要的实际节点，然后使用适当的方法是非常重要的。如果做不到这一点，您将被迫跳过重重障碍，试图获得您想要的实际数据。

综合起来，我会做类似的事情：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span>
  <a href="some_other_url">Another name I already got easily</a>
  <br>
  Some other text I need to get but just can't
</span>
EOT

data = doc.search('span').map{ |span|
  name = span.at('a').text
  url = span.at('a')['href']
  party = span.at('br').next.text.strip

  {
    name: name,
    url: url,
    party: party
  }
}
# => [{:name=>"Another name I already got easily", :url=>"some_other_url", :party=>"Some other text I need to get but just can't"}]

您可以 fold/spindle/mutilate 随心所欲。

最后，不要search('//path/to/some/node/text()').text。你在浪费按键和 CPU:

doc = Nokogiri::HTML(<<EOT)
<p>
  Some other text I need to get but just can't
</p>
EOT

doc.at('//p')        # => #<Nokogiri::XML::Element:0x3fed0841edf0 name="p" children=[#<Nokogiri::XML::Text:0x3fed0841e918 "\n  Some other text I need to get but just can't\n">]>
doc.at('//p/text()') # => #<Nokogiri::XML::Text:0x3fed0841e918 "\n  Some other text I need to get but just can't\n">

text() return 是一个文本节点，但它不是 return 文本。

结果你被迫做：

doc.at('//p/text()').text # => "\n  Some other text I need to get but just can't\n"

相反，指向你想要的东西并告诉 Nokogiri 得到它：

doc.at('//p').text  # => "\n  Some other text I need to get but just can't\n"

XPath 可以指向节点，但当我们需要文本时这无济于事，因此请简化选择器。

如何使用 XPath 仅获取包含其他元素的元素的文本？

How to get only the text of an element which contains other elements with XPath?

html

ruby

xpath

nokogiri