Python 在使用 urllib.request 遍历长长的 url 列表时挂起

Question

我编写了一些代码，循环遍历 url 列表，使用 urllib.request 打开它们，然后使用 beautifulsoup 解析它们。唯一的问题是列表很长（大约 5000 个）并且代码在无限期挂起之前成功运行了大约 200 urls。有没有办法 a) 在特定时间后跳到下一个 url，例如30 秒或 b) 重新尝试打开 url 一定次数，然后再继续下一个项目？

from bs4 import BeautifulSoup
import csv
import urllib.request
with open('csv_file.csv', 'r') as f:
  reader = csv.reader(f)
  urls_list = list(reader)
  for j in range(0, len(urls_list)):
    url= ''.join(urls_list[j])
    id=url[-10:].replace(".html","")

    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    s = urlopen(req).read()
    soup = BeautifulSoup(s, "lxml")

非常感谢任何建议！

Answer 1

文档 (python 2) 说：

The urllib2 module defines the following functions: urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]) Open the URL url, which can be either a string or a Request object.

像这样调整您的代码：

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=10).read()   # 10 seconds
exception HTTPError as e:
    print(str(e))  # print error detail (this may not be a timeout after all!)
    continue   # skip to next element

Python 在使用 urllib.request 遍历长长的 url 列表时挂起

Python hangs when looping throughlong list of urls using urllib.request

python

urllib