如何使用 Python 从 lxml 元素获取原始文本

Question

我想从根元素中获取以下内联文本字符串。

from lxml import etree

root = root = etree.fromstring(
'''<p>
    text-first
    <span>
        Child 1
    </span>
    text-middle
    <span>
        Child 2
    </span>
    text-last
</p>''')

root.text 仅 returns "text-first" 包括换行符

>>> build_text_list = etree.XPath("//text()")

>>> texts = build_text_list(root)
>>>
>>> texts
['\n    text-first\n    ', '\n        Child 1\n    ', '\n    text-middle\n    ', '\n        Child 2\n    ', '\n    text-last\n']
>>>
>>> for t in texts:
...     print t
...     print t.__dict__
...

    text-first

{'_parent': <Element p at 0x10140f638>, 'is_attribute': False, 'attrname': None, 'is_text': True, 'is_tail': False}

        Child 1

{'_parent': <Element span at 0x10140be18>, 'is_attribute': False, 'attrname': None, 'is_text': True, 'is_tail': False}

    text-middle

{'_parent': <Element span at 0x10140be18>, 'is_attribute': False, 'attrname': None, 'is_text': False, 'is_tail': True}

        Child 2

{'_parent': <Element span at 0x10140be60>, 'is_attribute': False, 'attrname': None, 'is_text': True, 'is_tail': False}

    text-last

{'_parent': <Element span at 0x10140be60>, 'is_attribute': False, 'attrname': None, 'is_text': False, 'is_tail': True}
>>>
>>> root.xpath("./p/following-sibling::text()") # following 
[]

那么，我怎样才能从中得到 text-first/middle/last 部分呢？

Answer 1

我的错，xpath最后救了我

>>> root.xpath('child::text()')
['\n    text-first\n    ', '\n    text-middle\n    ', '\n    text-last\n']

Answer 2

print(root.xpath('normalize-space(//*)'))

Answer 3

etree 完全有能力做到这一点：

from lxml import etree

root: etree.Element = etree.fromstring(
'''<p>
    text-first
    <span>
        Child 1
    </span>
    text-middle
    <span>
        Child 2
    </span>
    text-last
</p>''')

print(
    root.text,
    root[0].tail,
    root[1].tail,
)

所有元素都是其子元素的列表，所以这里的索引指的是2个<span>元素。任何元素的 tail 属性包含紧跟在该元素之后的文本。

它当然会包括换行符，所以你可能想要 strip() 结果：root.text.strip()

Answer 4

您的初步猜测，//text() 表示：select 所有文本节点，无论它们在文档中的哪个位置。如果它们是 p 的直接子节点，或者如果它们不是 span.

的子节点，那么您实际上想要 select 的是文本节点

根据您显示的输入文档，最准确的答案是 /p/text()：

>>> root = etree.fromstring(
'''<p>
text-first
<span>
    Child 1
</span>
text-middle
<span>
    Child 2
</span>
text-last
</p>''')

>>> etree.XPath("/p/text()")(root)
['\n    text-first\n    ', '\n    text-middle\n    ', '\n    text-last\n']

您自己的解决方案 child::text() 表示：select 个文本节点，如果它们是当前上下文节点的子节点。它之所以有效，是因为在这种情况下，XPath 表达式是使用根元素 p 作为上下文来求值的。这就是为什么 text() 也有效的原因。

>>> etree.XPath("text()")(root)
['\n    text-first\n    ', '\n    text-middle\n    ', '\n    text-last\n']

如何使用 Python 从 lxml 元素获取原始文本

How to get the raw text from lxml element with Python

python

xml

lxml