避免 'character argument not in range' python3 解码

Question

我正在尝试将 requests.get() 调用的内容解码为特定 url。导致问题的 url 在代码的多次运行中并不总是相同的，但是产生问题的请求内容部分有一个三重反斜杠，这在使用 unicode-escape 解码时会出错.

作为 Python 3.6.1

中代码运行的简化版本

r=b'\xf0\\xebI'
r.decode('unicode-escape').strip().replace('{','\n')

产生以下错误：

OverflowError: character argument not in range(0x110000)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: decoding with 'unicode-escape' codec failed (OverflowError: character argument not in range(0x110000))

我想跳过产生错误的部分。我是一名新手 python 程序员，非常感谢任何帮助。

Answer 1

数据似乎被编码为 latin-1^*，因此最简单的解决方案是解码然后删除反斜杠。

>>> r=b'\xf0\\xebI'
>>> r.decode('latin-1').replace('\', '')
'ðëI'

^* 我猜是 latin-1（也称为 ISO-8859-1）- 响应的 content-type header 应该指定编码使用，这可能是其他 ISO-8859-* 编码之一。

Answer 2

这些步骤应该适用于您的情况

In [1]: r=b'\xf0\\xebI'                                                        
#Decode to utf-8 using backslashreplace
In [2]: x=r.decode('utf-8', errors='backslashreplace')                          
In [3]: x                                                                       
Out[3]: '\xf0\\xebI'
#Replace the extra backslash
In [4]: y = x.replace('\\','\')                                              
In [5]: y                                                                       
Out[5]: '\xf0\xebI'
#Encode to ascii and decode to unicode-escape
In [6]: z = y.encode('ascii').decode('unicode-escape')                          
In [7]: z                                                                       
Out[7]: 'ðëI'

请注意，这也适用于双斜线，您的正常情况

r=b'\xf0\xebI'
x=r.decode('utf-8', errors='backslashreplace')
y = x.replace('\\','\')
z = y.encode('ascii').decode('unicode-escape')
print(z)
#ðëI

避免 'character argument not in range' python3 解码

Avoiding 'character argument not in range' python3 decode

python

decode