python 中的字符和字节

Question

在阅读 this tutorial 时，我发现了 __unicode__ 和 __str__ 方法之间的以下区别：

Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.

这里的"character"和"byte"到底是怎么定义的？例如，在 C 中，一个 char 是一个字节，那么 char 不等于一个字节吗？或者，这是指（可能）可能是多个字节的 unicode 字符吗？例如，如果我们采用以下内容：

Ω (omega symbol)
03 A9 or u'\u03a9'

在python中，这是一个字符（Ω）和两个字节，还是两个字符（03 A9）和两个字节？或者我混淆了 char 和 character 之间的区别？

Answer 1

在Python中，u'\u03a9'是由单个 Unicode字符Ω (U+03A9)组成的字符串。该字符串的内部表示是一个实现细节，因此询问所涉及的字节没有意义。

歧义的一个来源是像 'é' 这样的字符串，它可以是单个字符 U+00E9 或双字符字符串 U+0065 U+ 0301.

>>> len(u'\u00e9'); print(u'\u00e9')
1
é
>>> len(u'\u0065\u0301'); print(u'\u0065\u0301')
2
é

但是两个字节的序列'\xce\xa9'，可以解释为U+03A9的UTF-8编码。

>>> u'\u03a9'.encode('utf-8')
'\xce\xa9'

>>> '\xce\xa9'.decode('utf-8')
u'\u03a9'

在 Python 3 中，这将是（UTF-8 是默认编码方案）

>>> '\u03a9'.encode()
b'\xce\xa9'
>>> b'\xce\xa9'.decode()
'Ω'

其他字节序列也可以解码为U+03A9:

>>> b'\xff\xfe\xa9\x03'.decode('utf16')
'Ω'
>>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32')
'Ω'

python 中的字符和字节

Char and bytes in python

python

unicode

byte

cpython