ElementTree.iterparse 流式和缓存请求抛出 ParseError

ElementTree.iterparse with streamed and cached request throws ParseError

我有一个 Flask 应用程序,它从 url 中检索 XML 文档并对其进行处理。我将 requests_cache 与 redis 一起使用以避免额外的请求,并使用 ElementTree.iterparse 来迭代流式传输的内容。这是我的代码示例(开发服务器和交互式解释器的结果相同):

>>> import requests, requests_cache
>>> import xml.etree.ElementTree as ET
>>> requests_cache.install_cache('test', backend='redis', expire_after=300)
>>> url = 'http://myanimelist.net/malappinfo.php?u=doomcat55&status=all&type=anime'
>>> response = requests.get(url, stream=True)
>>> for event, node in ET.iterparse(response.raw):
...     print(node.tag)

运行 上面的代码曾经抛出一个 ParseError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1301, in __next__
    self._root = self._parser._close_and_return_root()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1236, in _close_and_return_root
    root = self._parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

但是,运行 在缓存过期之前再次使用完全相同的代码实际上会打印出预期的结果!为什么 XML 解析只在第一次失败,我该如何解决?


编辑: 如果有帮助,我注意到 运行 没有缓存的相同代码会导致不同的 ParseError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1289, in __next__
    for event in self._parser.read_events():
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1272, in read_events
    raise event
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1230, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

我可以告诉你为什么这两种情况都失败了,对于后者是因为数据是 gzipped 你第一次调用 raw,无论你读第二次时发生什么解压缩数据的时间:

如果打印以下行:

for line in response.raw:
    print(line)

你看:

�=V���H�������mqn˫+i�������UȣT����F,�-§�ߓ+���G�o~�����7�C�M{�3D����೺C����ݣ�i�����SD�݌.N�&�HF�I�֎�9���J�ķ����s�*H�@$p�o���Ĕ�Y��v�����8}I,��`�cy�����gE�� �!��B�  &|(^���jo�?�^,���H���^~p��a���׫��j�

����a۱Yk<qba�RN6�����l�/�W����{/��߸�G

X�LxH��哫 .���g(�MQ ����Y�q��:&��>s�M�d4�v|��ܓ��k��A17�

然后解压:

import zlib
def decomp(raw):
    decompressor = zlib.decompressobj(zlib.MAX_WBITS | 16)
    for line in raw:
        yield decompressor.decompress(line)

for line in decomp(response.raw):
    print(line)

你看解压成功了:

<?xml version="1.0" encoding="UTF-8"?>
<myanimelist><myinfo><user_id>4731313</user_id><user_name>Doomcat55</user_name><user_watching>3</user_watching><user_completed>120</user_completed><user_onhold>8</user_onhold><user_dropped>41</user_dropped><user_plantowatch>2</user_plantowatch><user_days_spent_watching>27.83</user_days_spent_watching></myinfo><anime><series_animedb_id>64</series_animedb_id><series_title>Rozen Maiden</series_title><series_synonyms>; Rozen Maiden</series_synonyms><series_type>1</series_type><series_episodes>12</series_episodes><series_status>2</series_status><series_start>2004-10-08</series_start><series_end>2004-12-24</series_end><series_image>http://cdn.myanimelist.net/images/anime/2/15728.jpg</series_image>
..................................

缓存之后,如果我们读取几个字节:

response.raw.read(39)

你看我们得到了解压后的数据:

<?xml version="1.0" encoding="UTF-8"?>

忘记缓存并将 response.raw 传递给 iterparse 得到:

    raise e
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

因为它无法处理gzipped数据。

还在第一个 运行 缓存时使用以下内容:

for line in response.raw:
    print(line)

给我:

    ValueError: I/O operation on closed file.

那是因为缓存已经消耗了数据,所以实际上那里什么都没有,所以不确定是否真的可以使用带缓存的原始数据,因为数据被消耗了,文件也被消耗了句柄已关闭。

如果你使用lxml.fromstringlist:

import requests, requests_cache
import lxml.etree as et
requests_cache.install_cache()

def lazy(resp):
    for line in resp.iter_content():
        yield line

url = 'http://myanimelist.net/malappinfo.php?u=doomcat55&status=all&type=anime'

response = requests.get(url, stream=True)

for node in et.fromstringlist(lazy(response)):
    print(node)