为什么此解析器找不到使用名称空间前缀的 XML 标记的内容？

Question

我有这个 XML 代码，来自 this link:

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
  <channel>
    <item>
      <title>‘This Did Not Go Well’: Inside PG&amp;E’s Blackout Control Room</title>
      <dc:creator>Ivan Penn</dc:creator>
      <pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
    </item>
  </channel>
</rss>

当我尝试使用 lxml 解析它并遵循 documentation for xpath and XML namespaces 时，解析器找到了标题（不使用命名空间）但没有找到 authors/creators，它做：

from lxml import html

xml = """
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
  <channel>
    <item>
      <title>‘This Did Not Go Well’: Inside PG&amp;E’s Blackout Control Room</title>
      <dc:creator>Ivan Penn</dc:creator>
      <pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
    </item>
  </channel>
</rss>
"""


rss = html.fromstring(xml)
items = rss.xpath("//item")
for item in items:
    title = item.xpath("title")[0].text_content().strip()
    print(title)

    ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
    authors = item.xpath("dc:creator", namespaces = ns)
    print(authors)

此代码打印：

This Did Not Go Well’: Inside PG&E’s Blackout Control Room []

因为它正确地找到了 title 标签的内容，我认为它正在寻找单独的 <item> 标签。我将命名空间传递给 xpath 的方式有问题吗？

编辑：无论我是否使用尾部斜线，结果都是一样的，即

ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
ns = {"dc" : "http://purl.org/dc/elements/1.1"}

Answer 1

HTML 解析器忽略名称空间。这是 lxml 文档中 Running HTML doctests 部分的最后一句话：

The HTML parser notably ignores namespaces and some other XMLisms.

文档的

Another part 说：

Also note that the HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware.

改一下就可以了

authors = item.xpath("dc:creator", namespaces = ns)

至

authors = item.xpath("creator")

但由于 RSS 不是 HTML，请考虑使用 XML 解析器 (from lxml import etree)。

为什么此解析器找不到使用名称空间前缀的 XML 标记的内容？

Why does this parser not find the contents of the XML tag that uses a namespace prefix?

python

xml

xpath

lxml

xml-namespaces