Python encoding/decoding 个问题
Python encoding/decoding problems
如何将像这样的字符串 "weren\xe2\x80\x99t" 解码回正常编码。
所以这个词实际上不是 "weren\xe2\x80\x99t"?
例如:
print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')
“Things
“Things
Things
但我其实想得到“东西。
或:
print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')
weren’t
weren’t
werent
但我其实想得到的不是。
我应该怎么做?
你应该提供一个翻译映射,将 unicode 字符映射到其他 unicode 字符(如果你想重新编码,后者应该在 ASCII 范围内):
uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring) # prints: "weren't"
我映射了最常见的奇怪字符,所以这是基于 Oliver W. 答案的非常完整的答案。
这个功能绝不是理想的,但它是最好的开始。
还有更多的字符定义:
http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal
...
def unicodetoascii(text):
uni2ascii = {
ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
ord('\xc3\xa9'.decode('utf-8')): ord('e'),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),
ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),
ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),
}
return text.decode('utf-8').translate(uni2ascii).encode('ascii')
print unicodetoascii("weren\xe2\x80\x99t")
在 Python 3 我会这样做:
string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)
无需翻译图,只需代码:)
如何将像这样的字符串 "weren\xe2\x80\x99t" 解码回正常编码。
所以这个词实际上不是 "weren\xe2\x80\x99t"? 例如:
print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')
“Things
“Things
Things
但我其实想得到“东西。
或:
print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')
weren’t
weren’t
werent
但我其实想得到的不是。
我应该怎么做?
你应该提供一个翻译映射,将 unicode 字符映射到其他 unicode 字符(如果你想重新编码,后者应该在 ASCII 范围内):
uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring) # prints: "weren't"
我映射了最常见的奇怪字符,所以这是基于 Oliver W. 答案的非常完整的答案。
这个功能绝不是理想的,但它是最好的开始。 还有更多的字符定义:
http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal
...
def unicodetoascii(text):
uni2ascii = {
ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
ord('\xc3\xa9'.decode('utf-8')): ord('e'),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),
ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),
ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),
}
return text.decode('utf-8').translate(uni2ascii).encode('ascii')
print unicodetoascii("weren\xe2\x80\x99t")
在 Python 3 我会这样做:
string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)
无需翻译图,只需代码:)