调用 Python 对象时超出最大递归深度 - 相同的代码适用于某些公司但不适用于其他公司
maximum recursion depth exceeded while calling a Python object-same codes work for some firms but not others
我有以下代码:
import requests
import urllib
from bs4 import BeautifulSoup
import re
master_data=[{'cik_number': '1556179', 'company_name': 'RMR Industrials, Inc.', 'form_id': '10-K', 'date': '20200103', 'file_url': 'https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt'}]
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
"""Returns all sentences in the input text"""
sentences = re.findall(sentence_regex, input_text)
return sentences
rdterms=['research and development','R&D','product development','research, development',
'research, engineering, and development','research and product development']
# creates a list of earnings regex expressions
rdterms_regex=[re.compile(r'\b' + term + r'\b', re.IGNORECASE)
for term in rdterms]
def rdsentence(sentence:str):
"""Checks whether a sentence is R&D-oriented."""
for term in rdterms_regex:
if term.search(sentence):
return True
return False
for entry in master_data:
path=entry['file_url']
r=requests.get(path, headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
soup=str(soup)
entry['count']=0
sentences=identify_sentences(soup)
for sentence in sentences:
if rdsentence(sentence) is True:
entry['count']=entry['count']+1
else:
continue
print(master_data)
len(master_data)
错误信息如下:
如果我将 master_data 行更改为
master_data=[{'cik_number': '1041588', 'company_name': 'ACCESS-POWER INC', 'form_id': '10-K', 'date': '20200102', 'file_url': 'https://www.sec.gov/Archives/edgar/data/1041588/0001041588-20-000001.txt'}]
一切正常。
为什么这些准则适用于某些公司而不适用于其他公司?我应该如何修改代码?谢谢!
问题是str(soup) 定义不明确,将html5lib 抛入死循环。
正确的是
soup = soup.text
我有以下代码:
import requests
import urllib
from bs4 import BeautifulSoup
import re
master_data=[{'cik_number': '1556179', 'company_name': 'RMR Industrials, Inc.', 'form_id': '10-K', 'date': '20200103', 'file_url': 'https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt'}]
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
"""Returns all sentences in the input text"""
sentences = re.findall(sentence_regex, input_text)
return sentences
rdterms=['research and development','R&D','product development','research, development',
'research, engineering, and development','research and product development']
# creates a list of earnings regex expressions
rdterms_regex=[re.compile(r'\b' + term + r'\b', re.IGNORECASE)
for term in rdterms]
def rdsentence(sentence:str):
"""Checks whether a sentence is R&D-oriented."""
for term in rdterms_regex:
if term.search(sentence):
return True
return False
for entry in master_data:
path=entry['file_url']
r=requests.get(path, headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
soup=str(soup)
entry['count']=0
sentences=identify_sentences(soup)
for sentence in sentences:
if rdsentence(sentence) is True:
entry['count']=entry['count']+1
else:
continue
print(master_data)
len(master_data)
错误信息如下:
如果我将 master_data 行更改为
master_data=[{'cik_number': '1041588', 'company_name': 'ACCESS-POWER INC', 'form_id': '10-K', 'date': '20200102', 'file_url': 'https://www.sec.gov/Archives/edgar/data/1041588/0001041588-20-000001.txt'}]
一切正常。 为什么这些准则适用于某些公司而不适用于其他公司?我应该如何修改代码?谢谢!
问题是str(soup) 定义不明确,将html5lib 抛入死循环。 正确的是
soup = soup.text