Python 和编码，再次

Question

我在 Windows 的 Python (2.7.8) 中有下一个代码片段：

text1 = 'áéíóú'
text2 = text1.encode("utf-8")

我有下一个错误异常：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

有什么想法吗？

Answer 1

您忘记指定您正在处理 unicode 字符串：

text1 = u'áéíóú'  #prefix string with "u"
text2 = text1.encode("utf-8")

在 python 3 中这个行为已经改变，任何字符串都是 unicode，所以你不需要指定它。

Answer 2

我在 Linux 和 Python 2.7 中尝试了以下代码：

>>> text1 = 'áéíóú'
>>> text1
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
>>> type(text1)
<type 'str'>
>>> text1.decode("utf-8")
u'\xe1\xe9\xed\xf3\xfa'
>>> print '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
áéíóú
>>> print u'\xe1\xe9\xed\xf3\xfa'
áéíóú
>>> u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'

\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba是áéíóú的utf-8编码。而\xe1\xe9\xed\xf3\xfa就是áéíóú的unicode编码。

text1是utf-8编码的，只能通过解码到unicode:

text1.decode("utf-8")

unicode 字符串可以编码为 utf-8 字符串：

u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')

Python 和编码，再次

Python and encoding, again

python

character-encoding