BeautifulSoup 网络爬虫中的 UnicodeEncodeError
UnicodeEncodeError in BeautifulSoup webscraper
我在使用以下用于简单网络抓取工具的代码时遇到 unicode 编码错误。
print 'JSON scraper initializing'
from bs4 import BeautifulSoup
import json
import requests
import geocoder
# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
uniqueUrl = page + str(i)
urlBucket.append(uniqueUrl)
# Build response container
responseBucket = []
for i in urlBucket:
uniqueResponse = requests.get(i)
responseBucket.append(uniqueResponse)
# Build soup container
soupBucket = []
for i in responseBucket:
individualSoup = BeautifulSoup(i.text, 'html.parser')
soupBucket.append(individualSoup)
# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
script = i.find_all("script")[4]
eventsJSON = json.loads(script.text)
allSanFranciscoEvents.append(eventsJSON)
with open("allSanFranciscoEvents.json", "w") as writeJSON:
json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')
奇怪的是,有时这段代码可以工作,而且不会出错。它与代码的 for i in range
行有关。例如,如果我为范围输入 (2,4)
,它就可以正常工作。如果我将其更改为 1,3,
,它会显示:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)
谁能告诉我如何在我的代码中解决这个问题?如果我打印 allSanFranciscoEvents
,它正在读取所有数据,所以我相信问题出在最后一段代码中,带有 JSON 转储。非常感谢。
最佳修复
使用Python3! Python 2 going EOL 很快。今天用遗留 python 编写的新代码的保质期很短。
为了使您的代码在 python 3 中工作,我唯一需要更改的是调用 print()
函数而不是 print
关键字。然后您的示例代码可以正常运行。
坚持Python2
The odd thing is the sometimes, this code works, and doesn't give an
error. It has to do with the for i in range line of the code. For
example, if I put in (2,4) for the range, it works fine.
那是因为您正在请求具有不同范围的不同页面,并且并非每个页面都有无法使用 ascii 编解码器转换为 str
的字符。我必须转到响应的第 5 页才能得到与您所做的相同的错误。就我而言,是艺术家姓名 u'Mø'
导致了问题。所以这是一个重现问题的 1 班轮:
>>> str(u'Mø')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)
您的错误明确指出字符 u'\xe9'
:
>>> str(u'\xe9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
同样的问题,只是性质不同。字符是Latin small letter e with acute。 Python 正在尝试使用默认编码 'ascii',将 Unicode
字符串转换为 str
,但 'ascii' 不知道代码点是什么.
I believe the issue is happening in the final piece of code, with the
JSON dump.
是的,是:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
并且从回溯中,您可以看到它实际上来自写入文件 (fp.write(chunk)
)。
file.write()
writes a string
to a file, but u'\xe9'
is a unicode
object. The error message: 'ascii' codec can't encode character...
tells us that python is trying to encode that unicode
object to turn it into a str
type, so it can write it to the file. Calling encode on the unicode string uses the "default string encoding", which is defined here 变为 'ascii'.
要修复,不要留到 python 使用默认编码:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)
在您的具体示例中,您可以通过更改此来修复间歇性错误:
allSanFranciscoEvents.append(eventsJSON)
对此:
allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))
这样,您明确使用 'utf-8' 编解码器将 Unicode
字符串转换为 str
,这样 python 就不会尝试应用默认值编码,写入文件时'ascii'。
eventsJSON
是它不能使用的对象 eventsJSON.encode('utf-8')
。对于 Python 2.7 以 utf-8
或 unicode 写入文件,您可以使用 codecs
或使用二进制或 wb
标志写入。
with open("allSanFranciscoEvents.json", "wb") as writeJSON:
jsStr = json.dumps(allSanFranciscoEvents)
# the decode() needed because we need to convert it to binary
writeJSON.write(jsStr.decode('utf-8'))
print ('end')
# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
data = json.load(readJson)
print(data[0][0]["startDate"])
# 2019-02-04
我在使用以下用于简单网络抓取工具的代码时遇到 unicode 编码错误。
print 'JSON scraper initializing'
from bs4 import BeautifulSoup
import json
import requests
import geocoder
# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
uniqueUrl = page + str(i)
urlBucket.append(uniqueUrl)
# Build response container
responseBucket = []
for i in urlBucket:
uniqueResponse = requests.get(i)
responseBucket.append(uniqueResponse)
# Build soup container
soupBucket = []
for i in responseBucket:
individualSoup = BeautifulSoup(i.text, 'html.parser')
soupBucket.append(individualSoup)
# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
script = i.find_all("script")[4]
eventsJSON = json.loads(script.text)
allSanFranciscoEvents.append(eventsJSON)
with open("allSanFranciscoEvents.json", "w") as writeJSON:
json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')
奇怪的是,有时这段代码可以工作,而且不会出错。它与代码的 for i in range
行有关。例如,如果我为范围输入 (2,4)
,它就可以正常工作。如果我将其更改为 1,3,
,它会显示:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)
谁能告诉我如何在我的代码中解决这个问题?如果我打印 allSanFranciscoEvents
,它正在读取所有数据,所以我相信问题出在最后一段代码中,带有 JSON 转储。非常感谢。
最佳修复
使用Python3! Python 2 going EOL 很快。今天用遗留 python 编写的新代码的保质期很短。
为了使您的代码在 python 3 中工作,我唯一需要更改的是调用 print()
函数而不是 print
关键字。然后您的示例代码可以正常运行。
坚持Python2
The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine.
那是因为您正在请求具有不同范围的不同页面,并且并非每个页面都有无法使用 ascii 编解码器转换为 str
的字符。我必须转到响应的第 5 页才能得到与您所做的相同的错误。就我而言,是艺术家姓名 u'Mø'
导致了问题。所以这是一个重现问题的 1 班轮:
>>> str(u'Mø')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)
您的错误明确指出字符 u'\xe9'
:
>>> str(u'\xe9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
同样的问题,只是性质不同。字符是Latin small letter e with acute。 Python 正在尝试使用默认编码 'ascii',将 Unicode
字符串转换为 str
,但 'ascii' 不知道代码点是什么.
I believe the issue is happening in the final piece of code, with the JSON dump.
是的,是:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
并且从回溯中,您可以看到它实际上来自写入文件 (fp.write(chunk)
)。
file.write()
writes a string
to a file, but u'\xe9'
is a unicode
object. The error message: 'ascii' codec can't encode character...
tells us that python is trying to encode that unicode
object to turn it into a str
type, so it can write it to the file. Calling encode on the unicode string uses the "default string encoding", which is defined here 变为 'ascii'.
要修复,不要留到 python 使用默认编码:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)
在您的具体示例中,您可以通过更改此来修复间歇性错误:
allSanFranciscoEvents.append(eventsJSON)
对此:
allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))
这样,您明确使用 'utf-8' 编解码器将 Unicode
字符串转换为 str
,这样 python 就不会尝试应用默认值编码,写入文件时'ascii'。
eventsJSON
是它不能使用的对象 eventsJSON.encode('utf-8')
。对于 Python 2.7 以 utf-8
或 unicode 写入文件,您可以使用 codecs
或使用二进制或 wb
标志写入。
with open("allSanFranciscoEvents.json", "wb") as writeJSON:
jsStr = json.dumps(allSanFranciscoEvents)
# the decode() needed because we need to convert it to binary
writeJSON.write(jsStr.decode('utf-8'))
print ('end')
# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
data = json.load(readJson)
print(data[0][0]["startDate"])
# 2019-02-04