xpath <p> 里面 <h3> 空

Question

我在 python3 开始使用 xpath 并面临这种行为。这对我来说似乎是错误的。为什么它匹配 span-text，而不匹配 h3 中的 p-text？

>>> from lxml import etree

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]

>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']

非常感谢！

Answer 1

您的第一个 XPath 正确地没有返回任何结果，因为对应的 tree 中的 <h3> 不包含任何文本节点。您可以使用 tostring() 方法查看树的实际内容：

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'

解析器可能这样做-将h3变成空元素-因为它认为标题标签内的段落无效（而标题内的跨度有效）： Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?

要将 p 元素保留在 h3 中，您可以尝试使用不同的解析器，即使用 BeautifulSoup's parser :

>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'

xpath <p> 里面 <h3> 空

xpath <p> inside <h3> empty

python

xpath

lxml

python-3.x