etree 元素上的 xpath 产生意外结果

xpath on etree element yielding unexpected result

我正在 运行ning xpath 过滤带有 "item" 标签的 XML 提要。从结果列表中,我取第一个结果并使用 xpath 过滤 "title" 标记。但是,当我过滤 "title" 时,我从 xml 获得了一个没有 "item" 标签的标题。由于我在 "item" 结果集上执行 xpath,因此该行为是意外的。谁能告诉我这是怎么回事。

请参阅以下使用 xpath 的代码。

from urllib.request import urlopen
from lxml import etree
url = 'https://www.sec.gov/Archives/edgar/monthly/xbrlrss-2018-02.xml'
data = urlopen(url)
xml = data.read()
parser = etree.XMLParser(remove_blank_text=True, huge_tree=True)
root = etree.XML(xml, parser=parser)
items = root.xpath("//item")
first_item = items[0]
title = first_item.xpath("//title")[0].text
print(title)
#'All XBRL Data Submitted to the SEC for 2018-02'

我预计第一项如下:

<item>
<title>DST SYSTEMS INC (0000714603) (Filer)</title>
<link>http://www.sec.gov/Archives/edgar/data/714603/000071460318000013/0000714603-18-000013-index.htm</link>
<guid>http://www.sec.gov/Archives/edgar/data/714603/000071460318000013/0000714603-18-000013-xbrl.zip</guid>
<enclosure url="http://www.sec.gov/Archives/edgar/data/714603/000071460318000013/0000714603-18-000013-xbrl.zip" length="470442" type="application/zip" />
<description>10-K</description>
<pubDate>Wed, 28 Feb 2018 17:29:39 EST</pubDate>
<edgar:xbrlFiling xmlns:edgar="http://www.sec.gov/Archives/edgar"></item>

相反,当我这样做时: title = first_item.xpath("//title").text,我得到的标题是 ''All XBRL Data Submitted to the SEC for 2018-02'

标题来自:

<channel>
<title>All XBRL Data Submitted to the SEC for 2018-02</title>
<link>http://www.sec.gov/spotlight/xbrl/filings-and-feeds.shtml</link>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" href="http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2018-02.xml" rel="self" type="application/rss+xml" />
<description>This is a list all of the filings containing XBRL for 2018-02</description>
<language>en-us</language>
<pubDate>Wed, 28 Feb 2018 00:00:00 EST</pubDate>
<lastBuildDate>Wed, 28 Feb 2018 00:00:00 EST</lastBuildDate>

但是我有 运行 项目的 xpath,它确实是 xpath("items")。我不确定为什么我没有得到 'DST SYSTEMS INC (0000714603) (Filer)'.

的预期结果

而不是:

title = first_item.xpath("//title")[0].text

使用:

title = first_item.xpath("title")[0].text

区别是"title"前的“//”。

原因是“//title”select都是标题元素,无论它们在文档中的什么位置。仅使用 "title" 将 select 节点名称为 "title".