如何处理 Beautifulsoup 递归错误(或解析错误)
How to deal with Beautifulsoup Recursion Error (or parse error)
我有一堆 HTML 个文件,我正试图用 Beautifulsoup 读取它们。其中一些,我收到了一个错误。我试过解码、编码……但找不到问题所在。非常感谢您。
这是一个例子。
import requests
from bs4 import BeautifulSoup
new_text = requests.get('https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt')
soup = BeautifulSoup(new_text.content.decode('utf-8','ignore').encode("utf-8"),'lxml')
print(soup)
在 Jupyter notebook 上,我收到死内核错误。
在 Pycharm 上,我收到以下错误:(它会重复出现,因此删除了其中一些。但是很长。)
Traceback (most recent call last):
File "C:/Users/oe/.PyCharmCE2019.1/config/scratches/scratch_5.py", line 5, in <module>
print(soup)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1099, in __unicode__
return self.decode()
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\__init__.py", line 566, in decode
indent_level, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1254, in decode_contents
text = c.output_ready(formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 745, in output_ready
output = self.format_string(self, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 220, in format_string
if isinstance(formatter, Callable):
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\abc.py", line 190, in __instancecheck__
subclass in cls._abc_negative_cache):
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
坦率地说,我不确定您的代码的潜在问题是什么(虽然我在 Jupyter 笔记本中没有得到死内核),但这似乎有效:
url = 'https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt'
import requests
from bs4 import BeautifulSoup
new_text = requests.get(url)
soup = BeautifulSoup(new_text.text,'lxml')
print(soup.text)
请注意,在 soup
中,new_text.content
被替换为 new_text.text
,我必须删除 encode/decode 参数,而 print
命令必须从 print(soup)
(引发错误)更改为 print(soup.text)
,效果很好。也许更聪明的人可以解释...
另一个可行的选项是:
import urllib.request
response = urllib.request.urlopen(url)
new_text2 = response.read()
soup = BeautifulSoup(new_text2,'lxml')
print(soup.text)
我有一堆 HTML 个文件,我正试图用 Beautifulsoup 读取它们。其中一些,我收到了一个错误。我试过解码、编码……但找不到问题所在。非常感谢您。
这是一个例子。
import requests
from bs4 import BeautifulSoup
new_text = requests.get('https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt')
soup = BeautifulSoup(new_text.content.decode('utf-8','ignore').encode("utf-8"),'lxml')
print(soup)
在 Jupyter notebook 上,我收到死内核错误。 在 Pycharm 上,我收到以下错误:(它会重复出现,因此删除了其中一些。但是很长。)
Traceback (most recent call last):
File "C:/Users/oe/.PyCharmCE2019.1/config/scratches/scratch_5.py", line 5, in <module>
print(soup)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1099, in __unicode__
return self.decode()
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\__init__.py", line 566, in decode
indent_level, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1254, in decode_contents
text = c.output_ready(formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 745, in output_ready
output = self.format_string(self, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 220, in format_string
if isinstance(formatter, Callable):
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\abc.py", line 190, in __instancecheck__
subclass in cls._abc_negative_cache):
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
坦率地说,我不确定您的代码的潜在问题是什么(虽然我在 Jupyter 笔记本中没有得到死内核),但这似乎有效:
url = 'https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt'
import requests
from bs4 import BeautifulSoup
new_text = requests.get(url)
soup = BeautifulSoup(new_text.text,'lxml')
print(soup.text)
请注意,在 soup
中,new_text.content
被替换为 new_text.text
,我必须删除 encode/decode 参数,而 print
命令必须从 print(soup)
(引发错误)更改为 print(soup.text)
,效果很好。也许更聪明的人可以解释...
另一个可行的选项是:
import urllib.request
response = urllib.request.urlopen(url)
new_text2 = response.read()
soup = BeautifulSoup(new_text2,'lxml')
print(soup.text)