如何获取 Nokogiri 节点集的子节点的属性

Question

我正在尝试使用 Nokogiri 从以下 xml 中的所有 w:ins 和 w:del 项中获取 w:rsidR 值：

<w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
    <w:body>
        <w:p w14:paraId="56037BEC" w14:textId="1188FA30" w:rsidR="001665B3" w:rsidRDefault="008B4AC6">
            <w:r>
                <w:t xml:space="preserve">This is the story of a man who </w:t>
            </w:r>
            <w:ins w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="0">
                <w:r w:rsidR="003566BF">
                    <w:t>went</w:t>
                </w:r>
            </w:ins>
            <w:del w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="1">
                <w:r w:rsidDel="003566BF">
                    <w:delText>goes</w:delText>
                </w:r>
            </w:del>
            <w:r>
                <w:t xml:space="preserve">to the store to </w:t>
            </w:r>
            <w:ins w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="2">
                <w:r w:rsidR="003566BF">
                    <w:t>purchase</w:t>
                </w:r>
            </w:ins>
            ...
        </w:p>
    </w:body>
</w:document>

我使用RubyZip解压Docx文件如下：

zip = Zip::File.open("test.docx")
doc = zip.find_entry("word/document.xml")
file = Nokogiri::XML.parse(doc.get_input_stream)

到目前为止我有以下内容：

file.xpath('//w:ins').each do |n|
  puts n.children
  puts n.children.attr('w:rsidR')
end

产生：

<w:r w:rsidR="003566BF">
  <w:t>went</w:t>
</w:r>

<w:r w:rsidR="003566BF">
  <w:t>purchase</w:t>
</w:r>

<w:r w:rsidR="008C3761">
  <w:t>replace</w:t>
</w:r>

<w:r w:rsidR="009D3E86">
  <w:t>place</w:t>
</w:r>

<w:r w:rsidR="00F633DF">
  <w:t xml:space="preserve">was </w:t>
</w:r>

<w:r w:rsidR="00D46E57">
  <w:t>was</w:t>
</w:r>

<w:r w:rsidR="00F56399">
  <w:t xml:space="preserve"> sat</w:t>
</w:r>

我似乎无法正确访问 w:rsidR。我怎样才能做到这一点？我刚开始使用 Nokogiri，遇到了麻烦。

Answer 1

您可以使用 @ 获取属性值：

file.xpath('//w:ins/w:r/@w:rsidR|//w:del/w:r/@w:rsidDel').each do |id|
  puts id
end

w:del 元素中的 w:r 元素没有 w:rsidR 属性只有 w:rsidDel 属性。

Answer 2

正如@yenshirak 所说，w:del 标签中只有w:rsidDel。

所以，我认为你可以这样做：

file.xpath('//w:ins//@w:rsidR|//w:del//@w:rsidDel').map(&:value)

获取它们的值的数组。

如果要打印它，只需在其前面添加 puts 并删除 map，因为 Nokogiri 在值上调用 to_s。

puts file.xpath('//w:ins//@w:rsidR|//w:del//@w:rsidDel')

如何获取 Nokogiri 节点集的子节点的属性

How to get an attribute of the children of a Nokogiri nodeset

ruby

nokogiri