如何使用 XPath 仅获取包含其他元素的元素的文本?
How to get only the text of an element which contains other elements with XPath?
我正在使用 Nokogiri 和 XPath 解析文档。我对结构为:
的列表的内容感兴趣
<ul>
<li>
<div>
<!-- Some data I'm not interested in -->
</div>
<span>
<a href="some_url">A name I already got easily</a>
<br>
Some text I need to get but just can't
</span>
</li>
<li>
<div>
<!-- Some data I'm not interested in again -->
</div>
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
</li>
.
.
.
</ul>
我正在使用:
politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
politician = OpenStruct.new
politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
politician.url = row.at_xpath('span/a/@href').to_s.strip
politician.party = row.at_xpath('span').to_s.strip
politicians.push(politician)
end
这适用于 politician.name
和 politician.url
,但当涉及到 politician.party
时,即 <br>
标记后的文本,我无法隔离文本。使用
row.at_xpath('span').to_s.strip
给出 <span>
标签的所有内容,包括其他 HTML 元素。
关于如何获取此文本有什么建议吗?
span/text()
return 为空,因为 <span>
中的第一个文本节点是位于 span 开始标记和 <a/>
元素之间的空格(换行符和空格) .尝试改用以下 XPath:
span/text()[normalize-space()]
此 XPath 应该 return 是 <span>
的直接子节点的非空文本节点
我会这样做:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
EOT
doc.at('span br').next.text # => "\n Some other text I need to get but just can't\n"
或
doc.at('//span/br').next.text # => "\n Some other text I need to get but just can't\n"
清理生成的字符串很容易:
"\n Some other text I need to get but just can't\n".strip # => "Some other text I need to get but just can't"
你的代码存在的问题是你没有深入研究 DOM 来得到你想要的,而且你做错了事:
doc.at_xpath('//span').to_s # => "<span>\n <a href=\"some_other_url\">Another name I already got easily</a>\n <br>\n Some other text I need to get but just can't\n</span>"
to_s
与 to_html
相同,return 是原始标记中的节点。使用 text
会去掉标签,这会让你离得更近,但是,同样,你站得太远了:
doc.at_xpath('//span').text # => "\n Another name I already got easily\n \n Some other text I need to get but just can't\n"
因为<br>
不是一个容器你不能得到它的文本,但是你仍然可以用它来导航,然后得到next
节点,也就是文本节点,并且检索它:
doc.at('span br').next.class # => Nokogiri::XML::Text
解析XML/HTML时,指向你想要的实际节点,然后使用适当的方法是非常重要的。如果做不到这一点,您将被迫跳过重重障碍,试图获得您想要的实际数据。
综合起来,我会做类似的事情:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
EOT
data = doc.search('span').map{ |span|
name = span.at('a').text
url = span.at('a')['href']
party = span.at('br').next.text.strip
{
name: name,
url: url,
party: party
}
}
# => [{:name=>"Another name I already got easily", :url=>"some_other_url", :party=>"Some other text I need to get but just can't"}]
您可以 fold/spindle/mutilate 随心所欲。
最后,不要search('//path/to/some/node/text()').text
。你在浪费按键和 CPU:
doc = Nokogiri::HTML(<<EOT)
<p>
Some other text I need to get but just can't
</p>
EOT
doc.at('//p') # => #<Nokogiri::XML::Element:0x3fed0841edf0 name="p" children=[#<Nokogiri::XML::Text:0x3fed0841e918 "\n Some other text I need to get but just can't\n">]>
doc.at('//p/text()') # => #<Nokogiri::XML::Text:0x3fed0841e918 "\n Some other text I need to get but just can't\n">
text()
return 是一个文本节点,但它不是 return 文本。
结果你被迫做:
doc.at('//p/text()').text # => "\n Some other text I need to get but just can't\n"
相反,指向你想要的东西并告诉 Nokogiri 得到它:
doc.at('//p').text # => "\n Some other text I need to get but just can't\n"
XPath 可以指向节点,但当我们需要文本时这无济于事,因此请简化选择器。
我正在使用 Nokogiri 和 XPath 解析文档。我对结构为:
的列表的内容感兴趣<ul>
<li>
<div>
<!-- Some data I'm not interested in -->
</div>
<span>
<a href="some_url">A name I already got easily</a>
<br>
Some text I need to get but just can't
</span>
</li>
<li>
<div>
<!-- Some data I'm not interested in again -->
</div>
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
</li>
.
.
.
</ul>
我正在使用:
politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
politician = OpenStruct.new
politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
politician.url = row.at_xpath('span/a/@href').to_s.strip
politician.party = row.at_xpath('span').to_s.strip
politicians.push(politician)
end
这适用于 politician.name
和 politician.url
,但当涉及到 politician.party
时,即 <br>
标记后的文本,我无法隔离文本。使用
row.at_xpath('span').to_s.strip
给出 <span>
标签的所有内容,包括其他 HTML 元素。
关于如何获取此文本有什么建议吗?
span/text()
return 为空,因为 <span>
中的第一个文本节点是位于 span 开始标记和 <a/>
元素之间的空格(换行符和空格) .尝试改用以下 XPath:
span/text()[normalize-space()]
此 XPath 应该 return 是 <span>
我会这样做:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
EOT
doc.at('span br').next.text # => "\n Some other text I need to get but just can't\n"
或
doc.at('//span/br').next.text # => "\n Some other text I need to get but just can't\n"
清理生成的字符串很容易:
"\n Some other text I need to get but just can't\n".strip # => "Some other text I need to get but just can't"
你的代码存在的问题是你没有深入研究 DOM 来得到你想要的,而且你做错了事:
doc.at_xpath('//span').to_s # => "<span>\n <a href=\"some_other_url\">Another name I already got easily</a>\n <br>\n Some other text I need to get but just can't\n</span>"
to_s
与 to_html
相同,return 是原始标记中的节点。使用 text
会去掉标签,这会让你离得更近,但是,同样,你站得太远了:
doc.at_xpath('//span').text # => "\n Another name I already got easily\n \n Some other text I need to get but just can't\n"
因为<br>
不是一个容器你不能得到它的文本,但是你仍然可以用它来导航,然后得到next
节点,也就是文本节点,并且检索它:
doc.at('span br').next.class # => Nokogiri::XML::Text
解析XML/HTML时,指向你想要的实际节点,然后使用适当的方法是非常重要的。如果做不到这一点,您将被迫跳过重重障碍,试图获得您想要的实际数据。
综合起来,我会做类似的事情:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
EOT
data = doc.search('span').map{ |span|
name = span.at('a').text
url = span.at('a')['href']
party = span.at('br').next.text.strip
{
name: name,
url: url,
party: party
}
}
# => [{:name=>"Another name I already got easily", :url=>"some_other_url", :party=>"Some other text I need to get but just can't"}]
您可以 fold/spindle/mutilate 随心所欲。
最后,不要search('//path/to/some/node/text()').text
。你在浪费按键和 CPU:
doc = Nokogiri::HTML(<<EOT)
<p>
Some other text I need to get but just can't
</p>
EOT
doc.at('//p') # => #<Nokogiri::XML::Element:0x3fed0841edf0 name="p" children=[#<Nokogiri::XML::Text:0x3fed0841e918 "\n Some other text I need to get but just can't\n">]>
doc.at('//p/text()') # => #<Nokogiri::XML::Text:0x3fed0841e918 "\n Some other text I need to get but just can't\n">
text()
return 是一个文本节点,但它不是 return 文本。
结果你被迫做:
doc.at('//p/text()').text # => "\n Some other text I need to get but just can't\n"
相反,指向你想要的东西并告诉 Nokogiri 得到它:
doc.at('//p').text # => "\n Some other text I need to get but just can't\n"
XPath 可以指向节点,但当我们需要文本时这无济于事,因此请简化选择器。