如何避免对内容进行显式解码？

Question

我明白 .encode('utf-8') 是必要的。

# -*- coding: utf-8 -*-
import urllib2
url = u'https://fr.wikipedia.org/wiki/Nîmes'
response = urllib2.urlopen(url.encode('utf-8'))
content = response.read().decode('utf-8')
print type(content)

但是如何避免 .decode('utf-8')？毕竟，相关页面在 header.

中正确声明了其编码

Answer 1

您可以使用 requests:

# -*- coding: utf-8 -*-

import requests
url = u'https://fr.wikipedia.org/wiki/Nîmes'
response = requests.get(url)
content = response.content
text = response.text
assert type(content) == str
assert type(text) == unicode

Answer 2

正如您在问题中所说，您可以从 headers 中获取编码以避免 hard-coding 编码：

content = response.read().decode(response.headers.getparam('charset'))

如何避免对内容进行显式解码？

How can I avoid explicit decoding of content?

python

unicode

urllib2