Python 和 XML 错误

Question

我在尝试从 XML 中获取值时遇到错误。我得到 "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."

这是我的代码：

import requests
import lxml.etree
from requests.auth import HTTPBasicAuth

r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text

root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text

我收到这个错误：

Traceback (most recent call last):
  File "tickets2.py", line 8, in <module>
    root = lxml.etree.fromstring(r.text)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
  File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

这是 XML 的样子，我试图在最后一行抓取文件。

<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
  <title>Feed from some link here</title>
  <link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
  <link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
  <id>https://somelinkhere/folder/?parameter=abc</id>
  <updated>2018-03-06T17:48:09Z</updated>
  <dc:creator>company.com</dc:creator>
  <dc:date>2018-03-06T17:48:09Z</dc:date>
  <opensearch:totalResults>4</opensearch:totalResults>

我已尝试通过 https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml 等链接进行各种更改，但我仍然运行陷入同样的错误。

Answer 1

尝试使用 r.content 来代替 r.text，后者猜测文本编码并对其进行解码，它以字节形式访问响应主体。（参见 http://docs.python-requests.org/en/latest/user/quickstart/#response-content。）

您也可以使用 r.raw。有关详细信息，请参阅 parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)。

解决该问题后，您将遇到命名空间问题。您要查找的元素 (opensearch:totalResults) 的前缀 opensearch 已绑定到 uri http://a9.com/-/spec/opensearch/1.1/.

您可以通过组合命名空间 uri 和本地名称（Clark 表示法）来查找元素：

{http://a9.com/-/spec/opensearch/1.1/}totalResults

有关详细信息，请参阅 http://lxml.de/tutorial.html#namespaces。

这是一个实施了两项更改的示例：

os = "{http://a9.com/-/spec/opensearch/1.1/}"

root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

Python 和 XML 错误

Error with Python and XML

python

xml

lxml

python-requests