写入文件时出现 Unicode 编码错误

Unicode Encode Error when writing to file

我知道在使用 Python 2.x 时这是一个经常出现的问题。我目前正在使用 Python 2.7。我想要输出到制表符分隔文本文件的文本内容是从 Sql Server 2012 数据库 table 中提取的,该数据库的服务器排序规则设置为 SQL_Latin1_General_CP1_CI_AS

我得到的异常往往会有所不同,但本质上是: UnicodeDecodeError:'ascii' 编解码器无法解码位置 57 中的字节 0xa0:序号不在范围内(128)

或 UnicodeDecodeError:'ascii' 编解码器无法解码位置 308 中的字节 0xe2:序号不在范围内(128)

下面是我通常转向的内容,但仍然会导致错误:

from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
    #content processing done here
    #title is text pulled directly from database
    #just_text is content pulled from raw html inserted into beautiful soup
    #    and using its .get_text() to just retrieve the text content
    UTF8Writer = getwriter('utf8')
    myfile = UTF8Writer(myfile)
    myfile.write(text + '\t' + just_text)

我也试过:

# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')

and

title = title.decode('latin-1')
title = title.encode('utf-8')

and

title = unicode(title, 'latin-1')

我也将 with open() 替换为:

with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:

我不确定我做错了什么,或者忘了做什么。我还用解码交换了编码,以防万一我一直在向后执行 encoding/decoding 。没有成功。

如有任何帮助,我们将不胜感激。

更新

我添加了 print repr(title)print repr(just_text),并且在我第一次从数据库中检索 title 和执行 .get_text() 时都添加了。不确定这有多大帮助,但是....

我得到的标题是:<type 'str'> 对于 just_text 我得到:<type 'unicode'>

错误

这些是我从 BeautifulSoup Summary() 函数提取的内容中得到的错误。

C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

ValueError: Expected a bytes object, not a unicode object

回溯部分是:

File <myfile>, line 39, in <module>
  summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
  self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
  for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
  self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
  return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
  raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object

这里有一些建议。一切都有编码。您的问题只是找出不同部分的各种编码,将它们重新编码为通用格式,然后将结果写入文件。

我建议选择 utf-8 作为输出编码。

f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))

Beautiful soup的get_textreturnspython的unicode封装类型。 decode("latin-1")应该把你的数据库内容变成unicode类型,在写utf-8.

中编码的字节之前用制表符连接

问题是您混合了字节和 Unicode 文本:

>>> u'\xe9'.encode('utf-8') + '\t' + u'x'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

其中 u'\xe9'.encode('utf-8') 是编码 é character (U+00e9) using utf-8 encoding. And u'x' is a Unicode text that contains x character (U+0078).

的字节串

解决方案是使用 Unicode 文本:

>>> print u'\xe9' + '\t' + u'x'
é       x

BeautifulSoup 接受 Unicode 输入:

>>> import bs4
>>> bs4.BeautifulSoup(u'\xe9' + '\t' + u'x')
<html><body><p>é        x</p></body></html>
>>> bs4.__version__
'4.2.1'

避免不必要的转换to/from Unicode。将 一次 输入数据解码为 Unicode 并在任何地方使用它来表示程序中的文本,并在末尾将输出编码为字节(如有必要):

with open('output.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))