写入文件时出现 Unicode 编码错误

Question

我知道在使用 Python 2.x 时这是一个经常出现的问题。我目前正在使用 Python 2.7。我想要输出到制表符分隔文本文件的文本内容是从 Sql Server 2012 数据库 table 中提取的，该数据库的服务器排序规则设置为 SQL_Latin1_General_CP1_CI_AS。

我得到的异常往往会有所不同，但本质上是： UnicodeDecodeError：'ascii' 编解码器无法解码位置 57 中的字节 0xa0：序号不在范围内（128）

或 UnicodeDecodeError：'ascii' 编解码器无法解码位置 308 中的字节 0xe2：序号不在范围内（128）

下面是我通常转向的内容，但仍然会导致错误：

from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
    #content processing done here
    #title is text pulled directly from database
    #just_text is content pulled from raw html inserted into beautiful soup
    #    and using its .get_text() to just retrieve the text content
    UTF8Writer = getwriter('utf8')
    myfile = UTF8Writer(myfile)
    myfile.write(text + '\t' + just_text)

我也试过：

# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')

and

title = title.decode('latin-1')
title = title.encode('utf-8')

and

title = unicode(title, 'latin-1')

我也将 with open() 替换为：

with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:

我不确定我做错了什么，或者忘了做什么。我还用解码交换了编码，以防万一我一直在向后执行 encoding/decoding 。没有成功。

如有任何帮助，我们将不胜感激。

更新

我添加了 print repr(title) 和 print repr(just_text)，并且在我第一次从数据库中检索 title 和执行 .get_text() 时都添加了。不确定这有多大帮助，但是....

我得到的标题是：<type 'str'> 对于 just_text 我得到：<type 'unicode'>

错误

这些是我从 BeautifulSoup Summary() 函数提取的内容中得到的错误。

C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

ValueError: Expected a bytes object, not a unicode object

回溯部分是：

File <myfile>, line 39, in <module>
  summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
  self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
  for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
  self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
  return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
  raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object

Answer 1

这里有一些建议。一切都有编码。您的问题只是找出不同部分的各种编码，将它们重新编码为通用格式，然后将结果写入文件。

我建议选择 utf-8 作为输出编码。

f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))

Beautiful soup的get_textreturnspython的unicode封装类型。 decode("latin-1")应该把你的数据库内容变成unicode类型，在写utf-8.

中编码的字节之前用制表符连接

Answer 2

问题是您混合了字节和 Unicode 文本：

>>> u'\xe9'.encode('utf-8') + '\t' + u'x'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

其中 u'\xe9'.encode('utf-8') 是编码 é character (U+00e9) using utf-8 encoding. And u'x' is a Unicode text that contains x character (U+0078).

的字节串

解决方案是使用 Unicode 文本：

>>> print u'\xe9' + '\t' + u'x'
é       x

BeautifulSoup 接受 Unicode 输入：

>>> import bs4
>>> bs4.BeautifulSoup(u'\xe9' + '\t' + u'x')
<html><body><p>é        x</p></body></html>
>>> bs4.__version__
'4.2.1'

避免不必要的转换to/from Unicode。将一次输入数据解码为 Unicode 并在任何地方使用它来表示程序中的文本，并在末尾将输出编码为字节（如有必要）：

with open('output.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

写入文件时出现 Unicode 编码错误

Unicode Encode Error when writing to file

python

unicode

encoding

python-2.7

更新

错误