如何使用 Nokogiri 获取除具有特定标签的文本之外的所有文本？

Question

我有以下 XML:

<w:body>
  <w:p w14:paraId="15812FB6" w14:textId="27A946A1" w:rsidR="001665B3" w:rsidRDefault="00771852">
    <w:r>
      <w:t xml:space="preserve">I am writing this </w:t>
    </w:r>
    <w:ins w:author="Mitchell Gould" w:date="2016-10-04T17:24:00Z" w:id="0">
      <w:r w:rsidR="00A1573E">
        <w:t>text to look</w:t>
      </w:r>
    </w:ins>
    <w:del w:author="Mitchell Gould" w:date="2016-10-04T17:24:00Z" w:id="1">
      <w:r w:rsidDel="00A1573E">
        <w:delText>to test</w:delText>
      </w:r>
    </w:del>
...

我知道我可以使用以下方法获取所有文本：

only_text_array = @file.search('//text()')

然而，我其实想要两个文本集：

其中包含除 <w:del>...</w:del> 元素中的文本之外的所有文本。
另一个包含除 <w:ins>...</w:ins> 元素中的文本之外的所有文本。

我怎样才能做到这一点？

Answer 1

您可以尝试使用以下 XPath :

//text()[not(ancestor::w:del or ancestor::w:ins)]

xpatheval demo

此 XPath returns 祖先 none 为 w:del 或 w:ins

的所有文本节点

Answer 2

我会这样做：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p class="ignore">foobar</p>
    <p>Keep this</p>
    <p class="ignore2">foobar2</p>
  </body>
</html>
EOT

text1, text2 = %w[.ignore .ignore2].map do |s|
  tmp_doc = doc.dup
  tmp_doc.search(s).remove
  tmp_doc.text.strip
end

text1 # => "Keep this\n    foobar2"
text2 # => "foobar\n    Keep this"

它遍历不需要的东西的选择器列表，dups 文档，然后删除不需要的节点，returns 稍微清理后的文档文本。

dup 默认执行深度复制，因此删除节点不会影响 doc。

如何使用 Nokogiri 获取除具有特定标签的文本之外的所有文本？

How to get all the text excluding text with specific tags with Nokogiri?

ruby

xml

nokogiri