无法使用示例数据在 windows 中重现 UnicodeDecodeError

Unable to reproduce UnicodeDecodeError in windows with sample data

我正在浏览 book 中的一些代码,其中有一个示例说明当我们尝试 read/write 二进制数据 [=21] 时如何抛出 UnicodeDecodeError =] 一个文件 -- 没有指定编码,或者指定 read/write 模式。但我无法重现书中显示的错误。为什么文件是用 utf-8 写入的,而我认为是用 cp1252 读取的?

注意:我使用的 python 版本与本书相同。

# try to write binary data in write mode
with open('data.bin', 'w') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5')
# TypeError: write() argument must be str, not bytes


# write a file with binary data in write-binary mode
with open('data.bin', 'wb') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5')
# written file has UTF-8 encoding


# preferred encoding in my system is different
import locale
print(locale.getpreferredencoding())
# cp1252


# read file with binary data and is in UTF-8 encoding \
# should fail acc to book, but doesn't
with open ('data.bin','r') as f:
    print(f.read())
# ñòóôõ
# its as if encoding is 'cp1252' by default
# EXPECTED:
# Traceback ...
#     UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in 
#         ➥position 0: invalid continuation byte


# book specifies the encoding to get same result as above 
with open('data.bin', 'r', encoding='cp1252') as f:
    data = f.read()

该文件不是 被写入为 UTF-8。错误是 open() 没有明确的 encoding 关键字参数以系统的默认编码打开文件,因此文件被读取为 CP1252。

显然,本书假定您所处的系统默认系统编码为 UTF-8,这在每个远程理智的现代系统上都是正确的,而不是 Windows(抱歉重复)。如果这本书本身没有实际解释,那将是令人惊讶的。

你对UTF-8的理解显然是不完整的。 \xf1\xf2 不可能是有效的 UTF-8。您可以检查相应代码点的实际 UTF-8 编码,例如与

>>> '\u00f1\u00f2'.encode('utf-8')
b'\xc3\xb1\xc3\xb2'

现在可能是阅读编码的好时机。 Stack Overflow character-encoding tag info page has a brief primer and links to more resources, including Joel Spolsky's standand piece The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) For Python, probably also read Ned Batchelder's Pragmatic Unicode。这两本书都很短,可以在睡前一起阅读,或者现在就预留 45 分钟,也许还包括一些阅读时的实验时间。