下载PDF:远程端关闭连接无响应

Download PDFs : Remote end closed connection without response

我想使用 Python 从数千个 PDF 文件中收集文本。从 PDF 中提取文本工作正常,但我的代码在执行过程中随机停止(不会每次都停止在同一个 PDF 上)并出现此错误:

http.client.RemoteDisconnected: Remote end closed connection without response

我正在使用 urllib。我想知道如何避免这个错误,如果我不能如何捕捉它(即使 except: 也不起作用)

我使用的代码:

df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)

for i,row in df.iterrows():
    print(row['year'], "- adding ",row['title'])
    request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
    try:
        row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
    except TypeError:
        row['fullarticle'] = ""
        pass

os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df

你需要把 try except 块放在这里 -

for i,row in df.iterrows():
    print(row['year'], "- adding ",row['title'])
    try:
        request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
    except http.client.RemoteDisconnected:
        continue # this will skip the url throwing error

您可以找到异常的文档 here

首先,你应该把 request.urlretrieve(row['pdfarticle'],"_tmp.pdf") 放在 try catch 块下。

其次,如果问题只是因为网络问题。您应该使用重试几次。像这样:

retry = MAX_TRIES
while retry != 0:
  try:
    request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
    break
  except http.client.RemoteDisconnected:
    retry -= 1