为什么此解析器找不到使用名称空间前缀的 XML 标记的内容?
Why does this parser not find the contents of the XML tag that uses a namespace prefix?
我有这个 XML 代码,来自 this link:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
<channel>
<item>
<title>‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room</title>
<dc:creator>Ivan Penn</dc:creator>
<pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
</item>
</channel>
</rss>
当我尝试使用 lxml
解析它并遵循 documentation for xpath and XML namespaces 时,解析器找到了标题(不使用命名空间)但没有找到 authors/creators,它做:
from lxml import html
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
<channel>
<item>
<title>‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room</title>
<dc:creator>Ivan Penn</dc:creator>
<pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
</item>
</channel>
</rss>
"""
rss = html.fromstring(xml)
items = rss.xpath("//item")
for item in items:
title = item.xpath("title")[0].text_content().strip()
print(title)
ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
authors = item.xpath("dc:creator", namespaces = ns)
print(authors)
此代码打印:
This Did Not Go Well’: Inside PG&E’s Blackout Control Room
[]
因为它正确地找到了 title 标签的内容,我认为它正在寻找单独的 <item>
标签。我将命名空间传递给 xpath
的方式有问题吗?
编辑:无论我是否使用尾部斜线,结果都是一样的,即
ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
ns = {"dc" : "http://purl.org/dc/elements/1.1"}
HTML 解析器忽略名称空间。这是 lxml 文档中 Running HTML doctests 部分的最后一句话:
The HTML parser notably ignores namespaces and some other XMLisms.
文档的 Another part 说:
Also note that the HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware.
改一下就可以了
authors = item.xpath("dc:creator", namespaces = ns)
至
authors = item.xpath("creator")
但由于 RSS 不是 HTML,请考虑使用 XML 解析器 (from lxml import etree
)。
我有这个 XML 代码,来自 this link:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
<channel>
<item>
<title>‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room</title>
<dc:creator>Ivan Penn</dc:creator>
<pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
</item>
</channel>
</rss>
当我尝试使用 lxml
解析它并遵循 documentation for xpath and XML namespaces 时,解析器找到了标题(不使用命名空间)但没有找到 authors/creators,它做:
from lxml import html
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
<channel>
<item>
<title>‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room</title>
<dc:creator>Ivan Penn</dc:creator>
<pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
</item>
</channel>
</rss>
"""
rss = html.fromstring(xml)
items = rss.xpath("//item")
for item in items:
title = item.xpath("title")[0].text_content().strip()
print(title)
ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
authors = item.xpath("dc:creator", namespaces = ns)
print(authors)
此代码打印:
This Did Not Go Well’: Inside PG&E’s Blackout Control Room []
因为它正确地找到了 title 标签的内容,我认为它正在寻找单独的 <item>
标签。我将命名空间传递给 xpath
的方式有问题吗?
编辑:无论我是否使用尾部斜线,结果都是一样的,即
ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
ns = {"dc" : "http://purl.org/dc/elements/1.1"}
HTML 解析器忽略名称空间。这是 lxml 文档中 Running HTML doctests 部分的最后一句话:
文档的The HTML parser notably ignores namespaces and some other XMLisms.
Another part 说:
Also note that the HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware.
改一下就可以了
authors = item.xpath("dc:creator", namespaces = ns)
至
authors = item.xpath("creator")
但由于 RSS 不是 HTML,请考虑使用 XML 解析器 (from lxml import etree
)。