'utf-8' 编解码器无法解码字节 0x80

Question

我正在尝试下载 BVLC 训练的模型，但遇到此错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte

我认为是因为下面的函数(complete code)

  # Closure-d function for checking SHA1.
  def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
      with open(filename, 'r') as f:
          return hashlib.sha1(f.read()).hexdigest() == sha1

知道如何解决这个问题吗？

Answer 1

您没有指定以二进制模式打开文件，因此 f.read() 试图将文件作为 UTF-8 编码的文本文件读取，这似乎不起作用。但是由于我们采用的是 bytes 的散列，而不是 strings 的散列，因此编码是什么，甚至文件是否为文本都无关紧要完全：打开它，然后以二进制文件的形式读取它。

>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
  File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
    with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
  File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte

但是

>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325

Answer 2

您正在打开一个非 UTF-8 编码的文件，而您系统的默认编码设置为 UTF-8。

由于您计算的是 SHA1 哈希，因此您应该将数据读取为 binary。 hashlib 函数要求您以字节形式传递：

with open(filename, 'rb') as f:
    return hashlib.sha1(f.read()).hexdigest() == sha1

注意在文件模式中添加b

参见open() documentation：

mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. [...] In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)

和来自 hashlib module documentation：

You can now feed this object with bytes-like objects (normally bytes) using the update() method.

Answer 3

由于文档和 src 代码中没有任何提示，我不知道为什么，但是使用 b char（我猜是二进制）完全有效（tf-version：1.1.0）：

image_data = tf.gfile.FastGFile(filename, 'rb').read()

For more information, check out: gfile

'utf-8' 编解码器无法解码字节 0x80

'utf-8' codec can't decode byte 0x80

python

utf-8

caffe