如何在不在本地下载 zip 文件的情况下从网站上的 zip 文件中读取数据

Question

我正在使用以下代码：

import zipfile
import urllib

link = "http://www.dummypage.com/dummyfile.zip"
file_handle = urllib.urlopen(link)
zip_file_object = zipfile.ZipFile(file_handle, 'r')

执行时出现以下错误。请帮忙

Traceback (most recent call last):
  File "fcc.py", line 34, in <module>
    zip_file_object = zipfile.ZipFile(file_handle)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 807, in _RealGetContents
    endrec = _EndRecData(fp)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 208, in _EndRecData
    fpin.seek(0, 2)
AttributeError: addinfourl instance has no attribute 'seek'

Answer 1

您需要一个流处理程序接口来处理内存中的数据。对于文本数据，最常用的库是 StringIO. To binary data, the right lib is io.

import io
import urllib
import zipfile

link = "http://www.dummypage.com/dummyfile.zip"
file_handle = io.BytesIO(urllib.urlopen(link).read())
zip_file_object = zipfile.ZipFile(file_handle, 'r')

关键是，文件的下载确实完成了，但它会在一个临时文件夹中。而且你不需要关心它

Answer 2

可以使用外部工具吗？ @ruario 对 Bash - how to unzip a piped zip file (from “wget -qO-”) 的回答非常有趣。基本上，zip 将其目录存储在文件的末尾，而 zip 工具往往需要整个文件才能到达目录。但是，zip 还包括内联 headers 并且某些工具可以使用它们。如果您不介意调用 bsdtar（或其他工具），您可以这样做：

import urllib
import shutil
import subprocess as subp

url_handle = urllib.urlopen("test.zip")
proc = subp.Popen(['bsdtar', '-xf-'], stdin=subp.PIPE)
shutil.copyfileobj(url_handle, proc.stdin)
proc.stdin.close()
proc.wait()

如何在不在本地下载 zip 文件的情况下从网站上的 zip 文件中读取数据

How to read data from a zipfile on a website without locally downloading zipfile

python

python-2.7

python-zipfile