python lxml - 选择不带双斜杠的 xpath
python lxml - selecting xpath without double slash
但是在 python 中使用 lxml.html 尝试这样做是行不通的:
import requests
import lxml.html
s = requests.session()
page= s.get('http://lxml.de/')
html = lxml.html.fromstring(page.text)
p=html.xpath('p')
这里p
是一个空列表。
我需要改用 p=html.xpath('//p')
。
有人知道为什么吗?
该页面可能不包含 <p>
(即根目录),而是 <html>
,您假设它包含该 xpath 表达式。
要么使用双斜杠 //p
来检索所有 <p>
元素,要么使用对特定 <p>
的绝对引用向下走。下面以第一段内容进行演示:
p = html.xpath('/html/body/div/p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
等价于:
p = html.xpath('//p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
解析 <p>
不带正斜杠,这需要以前的带有搜索路径斜杠的 xpath:
div = p = html.xpath('/html/body/div')[0]
p = div.xpath('p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
但是在 python 中使用 lxml.html 尝试这样做是行不通的:
import requests
import lxml.html
s = requests.session()
page= s.get('http://lxml.de/')
html = lxml.html.fromstring(page.text)
p=html.xpath('p')
这里p
是一个空列表。
我需要改用 p=html.xpath('//p')
。
有人知道为什么吗?
该页面可能不包含 <p>
(即根目录),而是 <html>
,您假设它包含该 xpath 表达式。
要么使用双斜杠 //p
来检索所有 <p>
元素,要么使用对特定 <p>
的绝对引用向下走。下面以第一段内容进行演示:
p = html.xpath('/html/body/div/p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
等价于:
p = html.xpath('//p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
解析 <p>
不带正斜杠,这需要以前的带有搜索路径斜杠的 xpath:
div = p = html.xpath('/html/body/div')[0]
p = div.xpath('p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.