写入文件时出现 Unicode 编码错误
Unicode Encode Error when writing to file
我知道在使用 Python 2.x 时这是一个经常出现的问题。我目前正在使用 Python 2.7。我想要输出到制表符分隔文本文件的文本内容是从 Sql Server 2012 数据库 table 中提取的,该数据库的服务器排序规则设置为 SQL_Latin1_General_CP1_CI_AS
。
我得到的异常往往会有所不同,但本质上是:
UnicodeDecodeError:'ascii' 编解码器无法解码位置 57 中的字节 0xa0:序号不在范围内(128)
或
UnicodeDecodeError:'ascii' 编解码器无法解码位置 308 中的字节 0xe2:序号不在范围内(128)
下面是我通常转向的内容,但仍然会导致错误:
from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
#content processing done here
#title is text pulled directly from database
#just_text is content pulled from raw html inserted into beautiful soup
# and using its .get_text() to just retrieve the text content
UTF8Writer = getwriter('utf8')
myfile = UTF8Writer(myfile)
myfile.write(text + '\t' + just_text)
我也试过:
# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')
and
title = title.decode('latin-1')
title = title.encode('utf-8')
and
title = unicode(title, 'latin-1')
我也将 with open()
替换为:
with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:
我不确定我做错了什么,或者忘了做什么。我还用解码交换了编码,以防万一我一直在向后执行 encoding/decoding 。没有成功。
如有任何帮助,我们将不胜感激。
更新
我添加了 print repr(title)
和 print repr(just_text)
,并且在我第一次从数据库中检索 title
和执行 .get_text()
时都添加了。不确定这有多大帮助,但是....
我得到的标题是:<type 'str'>
对于 just_text 我得到:<type 'unicode'>
错误
这些是我从 BeautifulSoup Summary()
函数提取的内容中得到的错误。
C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\xff\xfe\x00\x00':
ValueError: Expected a bytes object, not a unicode object
回溯部分是:
File <myfile>, line 39, in <module>
summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object
这里有一些建议。一切都有编码。您的问题只是找出不同部分的各种编码,将它们重新编码为通用格式,然后将结果写入文件。
我建议选择 utf-8 作为输出编码。
f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))
Beautiful soup的get_text
returnspython的unicode封装类型。 decode("latin-1")
应该把你的数据库内容变成unicode类型,在写utf-8
.
中编码的字节之前用制表符连接
问题是您混合了字节和 Unicode 文本:
>>> u'\xe9'.encode('utf-8') + '\t' + u'x'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
其中 u'\xe9'.encode('utf-8')
是编码 é
character (U+00e9) using utf-8 encoding. And u'x'
is a Unicode text that contains x
character (U+0078).
的字节串
解决方案是使用 Unicode 文本:
>>> print u'\xe9' + '\t' + u'x'
é x
BeautifulSoup
接受 Unicode 输入:
>>> import bs4
>>> bs4.BeautifulSoup(u'\xe9' + '\t' + u'x')
<html><body><p>é x</p></body></html>
>>> bs4.__version__
'4.2.1'
避免不必要的转换to/from Unicode。将 一次 输入数据解码为 Unicode 并在任何地方使用它来表示程序中的文本,并在末尾将输出编码为字节(如有必要):
with open('output.html', 'wb') as file:
file.write(soup.prettify('utf-8'))
我知道在使用 Python 2.x 时这是一个经常出现的问题。我目前正在使用 Python 2.7。我想要输出到制表符分隔文本文件的文本内容是从 Sql Server 2012 数据库 table 中提取的,该数据库的服务器排序规则设置为 SQL_Latin1_General_CP1_CI_AS
。
我得到的异常往往会有所不同,但本质上是: UnicodeDecodeError:'ascii' 编解码器无法解码位置 57 中的字节 0xa0:序号不在范围内(128)
或 UnicodeDecodeError:'ascii' 编解码器无法解码位置 308 中的字节 0xe2:序号不在范围内(128)
下面是我通常转向的内容,但仍然会导致错误:
from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
#content processing done here
#title is text pulled directly from database
#just_text is content pulled from raw html inserted into beautiful soup
# and using its .get_text() to just retrieve the text content
UTF8Writer = getwriter('utf8')
myfile = UTF8Writer(myfile)
myfile.write(text + '\t' + just_text)
我也试过:
# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')
and
title = title.decode('latin-1')
title = title.encode('utf-8')
and
title = unicode(title, 'latin-1')
我也将 with open()
替换为:
with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:
我不确定我做错了什么,或者忘了做什么。我还用解码交换了编码,以防万一我一直在向后执行 encoding/decoding 。没有成功。
如有任何帮助,我们将不胜感激。
更新
我添加了 print repr(title)
和 print repr(just_text)
,并且在我第一次从数据库中检索 title
和执行 .get_text()
时都添加了。不确定这有多大帮助,但是....
我得到的标题是:<type 'str'>
对于 just_text 我得到:<type 'unicode'>
错误
这些是我从 BeautifulSoup Summary()
函数提取的内容中得到的错误。
C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\xff\xfe\x00\x00':
ValueError: Expected a bytes object, not a unicode object
回溯部分是:
File <myfile>, line 39, in <module>
summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object
这里有一些建议。一切都有编码。您的问题只是找出不同部分的各种编码,将它们重新编码为通用格式,然后将结果写入文件。
我建议选择 utf-8 作为输出编码。
f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))
Beautiful soup的get_text
returnspython的unicode封装类型。 decode("latin-1")
应该把你的数据库内容变成unicode类型,在写utf-8
.
问题是您混合了字节和 Unicode 文本:
>>> u'\xe9'.encode('utf-8') + '\t' + u'x'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
其中 u'\xe9'.encode('utf-8')
是编码 é
character (U+00e9) using utf-8 encoding. And u'x'
is a Unicode text that contains x
character (U+0078).
解决方案是使用 Unicode 文本:
>>> print u'\xe9' + '\t' + u'x'
é x
BeautifulSoup
接受 Unicode 输入:
>>> import bs4
>>> bs4.BeautifulSoup(u'\xe9' + '\t' + u'x')
<html><body><p>é x</p></body></html>
>>> bs4.__version__
'4.2.1'
避免不必要的转换to/from Unicode。将 一次 输入数据解码为 Unicode 并在任何地方使用它来表示程序中的文本,并在末尾将输出编码为字节(如有必要):
with open('output.html', 'wb') as file:
file.write(soup.prettify('utf-8'))