下载PDF:远程端关闭连接无响应
Download PDFs : Remote end closed connection without response
我想使用 Python 从数千个 PDF 文件中收集文本。从 PDF 中提取文本工作正常,但我的代码在执行过程中随机停止(不会每次都停止在同一个 PDF 上)并出现此错误:
http.client.RemoteDisconnected: Remote end closed connection without response
我正在使用 urllib。我想知道如何避免这个错误,如果我不能如何捕捉它(即使 except:
也不起作用)
我使用的代码:
df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
try:
row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
except TypeError:
row['fullarticle'] = ""
pass
os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df
你需要把 try except 块放在这里 -
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
try:
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
except http.client.RemoteDisconnected:
continue # this will skip the url throwing error
您可以找到异常的文档 here。
首先,你应该把 request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
放在 try catch 块下。
其次,如果问题只是因为网络问题。您应该使用重试几次。像这样:
retry = MAX_TRIES
while retry != 0:
try:
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
break
except http.client.RemoteDisconnected:
retry -= 1
我想使用 Python 从数千个 PDF 文件中收集文本。从 PDF 中提取文本工作正常,但我的代码在执行过程中随机停止(不会每次都停止在同一个 PDF 上)并出现此错误:
http.client.RemoteDisconnected: Remote end closed connection without response
我正在使用 urllib。我想知道如何避免这个错误,如果我不能如何捕捉它(即使 except:
也不起作用)
我使用的代码:
df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
try:
row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
except TypeError:
row['fullarticle'] = ""
pass
os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df
你需要把 try except 块放在这里 -
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
try:
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
except http.client.RemoteDisconnected:
continue # this will skip the url throwing error
您可以找到异常的文档 here。
首先,你应该把 request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
放在 try catch 块下。
其次,如果问题只是因为网络问题。您应该使用重试几次。像这样:
retry = MAX_TRIES
while retry != 0:
try:
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
break
except http.client.RemoteDisconnected:
retry -= 1