python lxml - 选择不带双斜杠的 xpath

Question

The documentation about xpath states that if there is no slash in the xpath, the expression will select elements wherever they are.

但是在 python 中使用 lxml.html 尝试这样做是行不通的：

import requests
import lxml.html
s = requests.session()
page= s.get('http://lxml.de/')
html = lxml.html.fromstring(page.text)
p=html.xpath('p')

这里p是一个空列表。

我需要改用 p=html.xpath('//p')。

有人知道为什么吗？

Answer 1

该页面可能不包含 <p>（即根目录），而是 <html>，您假设它包含该 xpath 表达式。

要么使用双斜杠 //p 来检索所有 <p> 元素，要么使用对特定 <p> 的绝对引用向下走。下面以第一段内容进行演示：

p = html.xpath('/html/body/div/p')

print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.

等价于：

p = html.xpath('//p')

print(p[0].text)    
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.

解析 <p> 不带正斜杠，这需要以前的带有搜索路径斜杠的 xpath：

div = p = html.xpath('/html/body/div')[0]    
p = div.xpath('p')

print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.

python lxml - 选择不带双斜杠的 xpath

python lxml - selecting xpath without double slash

python

xpath

lxml