为什么在 Python 中 en-dash 写成 '\xe2\x80\x93'？

Question

具体来说，\xe2\x80\x93中的每一次转义是做什么的，为什么需要3次转义？尝试自行解码会导致 'unexpected end of data' 错误。

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

Answer 1

当编码到该编解码器时，您有 UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint 编码到这 3 个字节。

尝试只解码像 UTF-8 这样的一个字节是行不通的，因为在 UTF-8 标准中，一个字节本身并没有意义。在 UTF-8 编码方案中，\xe2 字节用于 Unicode 标准中 U+2000 和 U+2FFF 之间的所有代码点（它们都将使用额外的 2 个字节进行编码）；总共有 4095 个代码点。

Python 表示 bytes 对象中的值，使您可以通过将值复制回 Python 脚本或终端来重现该值。任何不可打印的 ASCII 都由 \xhh 十六进制转义符表示。这两个字符组成字节的十六进制值，0到255之间的整数。

十六进制是一种非常有用的字节表示方式，因为您可以用一个字符表示 2 对 4 个字节，每个字符在 0 - F 范围内。

\xe2\x80\x93则表示有三个字节，十六进制为E2、80、93，十进制分别为226、128、147。 UTF-8 标准告诉解码器获取第一个字节的最后 4 位，以及第二个和第三个字节的最后 6 个字节（其余位用于表示您正在处理的字节类型错误处理）。那些 4 + 6 + 6 == 16 位然后编码十六进制值 2013（0010 000000 010011 二进制）。

您可能想了解编解码器（编码）和 Unicode 之间的区别； UTF-8 是一种编解码器，可以处理所有的 Unicode 标准，但不是一回事。参见：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 作者：乔尔·斯波尔斯基
Pragmatic Unicode 作者：内德·巴切尔德
Python Unicode HOWTO

为什么在 Python 中 en-dash 写成 '\xe2\x80\x93'？

Why is the en-dash written as '\xe2\x80\x93' in Python?

python

unicode

encoding

utf-8