使用utf-8编码后无法使用utf-8解码

Question

在某种情况下，我不得不将数据存储为 utf-8，而现在当我想获取 decode('utf-8') 数据时，它根本不起作用。以下面的行为例：

\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87

您只需复制下面的行即可将上面的字符串转换为人类可读的格式：

b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")

但是找不到在不破坏字符串的情况下将字符串转换为 bytestring 的方法。我尝试了以下方法，但都失败了：

.decode("utf-8")
.decode()
.bytes()

到目前为止，我无法在 OS 或其他地方找到解决方案。感谢任何帮助。

Answer 1

x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'

以上几行（均在问题中给出）是 String and Bytes literals 的特定实例（分别）：

\xhh Character with hex value hh (^{2, 3})

² Unlike in Standard C, exactly two hex digits are required.

³ In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a Unicode character with the given value.

让我们检查一下这样定义的字符串（在Python提示符内）：

>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nØ³Ø§Ù\x82Û\x8câ\x80\x8cÙ\x86Ø§Ù\x85Ù\x87'
>>> print( xstr)

Ø³Ø§ÙÛâÙØ§Ù
Ù
>>>

显然，print( xstr) 输出与任何已知语言中的单词都不相似，但是它的所有字符（根据定义）都属于 Unicode 范围 r'[\u0000-\u00ff]'，即 Unicode 中的前 256 个字符，并且瞧 - iso-8859-1 aka 'latin1'.

我们需要获取 xstr 字符串的编码版本作为字节对象，例如使用 str.encode method or built-in bytes() 函数。然后

print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())

ساقی‌نامه

ساقی‌نامه

使用utf-8编码后无法使用utf-8解码

can not decoed using utf-8 after encoding with utf-8

decode

utf-8

python-3.x