python 将中文编码为特殊字符

Question

我有 scrape/curl 请求从其他网站获取 html，该网站有中文，但有些文本结果很奇怪，它显示如下：

°¢Àï°Í°ÍÎªÄúÌá¹©ÁË×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅÆµç×Ó±í ÖÇÄÜÊ±ÉÐ³±Á÷Å®Ê¿ÊÖ»·ÊÖÁ´Ê×ÊÎ±í´øµÈ²úÆ·£¬ÕâÀïÔÆ¼¯ÁËÖÚ¶àµÄ¹©Ó¦ÉÌ£¬²É¹ºÉÌ£¬ÖÆÔìÉÌ¡£ÓûÁË½â¸ü¶à×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅÆµç×Ó±í ÖÇÄÜÊ±ÉÐ³±Á÷Å®Ê¿ÊÖ»·ÊÖÁ´Ê×ÊÎ±í´øÐÅÏ¢£¬Çë·ÃÎÊ°¢Àï°Í°ÍÅú·¢Íø£¡

应该是中文的，这是我的代码：

str(result.decode('ISO-8859-1'))

如果没有decode 'ISO-8859-1'（只有return result变量）会显示这样的问号：

��Ͱ�Ϊ��ṩ��߹��ֱ��Ʒ�Ƶ��ӱ� ��ʱ�г��Ůʿ�ֻ��α��Ȳ�Ʒ��Ƽ��ڶ�Ĺ�Ӧ�̣��ɹ��̣��̡��˽��߹��ֱ��Ʒ�Ƶ��ӱ� ��ʱ�г��Ůʿ�ֻ��α��Ϣ��ʰ��Ͱ��

你能帮我看看我应该用哪个 encode/decode 吗？

谢谢

Answer 1

中文有几种可能的字符集。
3种常见的中文字符集是：gb2312、big5和gbk。
这是从 gb2312 转换为 utf-8.

的片段

import codecs

infile = codecs.open("in.txt", "r", "gb2312")
lines = infile.readline()
infile.close()

print(lines)

outfile = codecs.open("out.txt", "wb", "utf-8")
outfile.writelines(lines)
outfile.close()

Answer 2

试试这个代码块。

您可以导入 unquote 文件并使用 latin1 编码机制对内容进行编码。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib2 import unquote

bytesquoted = u'å%8f°å%8d%97 è¦ªå%90é¤%90å»³'.encode('latin1')
unquoted = unquote(bytesquoted)
print unquoted.decode('utf8')

输出：

台南亲子餐厅

Answer 3

正如@Thu Yein tun 所提到的，这是一个非常简单的解决方案，查看内容类型的 http 请求 link 的 header 响应，我显示为 text/html;字符集=GBK, 然后我像这样给出我的代码的解决方案

result.decode('gbk')

python 将中文编码为特殊字符

python encoding chinese to special character

python

encode

decode