我的代码有时会在 response.read() 函数上休眠

Question

for index in range(1,10):
    send_headers = {
                    'User-Agent':'Mozilla/5.0 (Windows NT 6.2;rv:16.0) Gecko/20100101 Firefox/16.0',
                    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                    'Connection':'keep-alive'
    }

    try:
        req=urllib2.Request(url,headers=send_headers)
        response=urllib2.urlopen(req)
        sleeptime=random.randint(1,30*index)
        time.sleep(sleeptime)
    except Exception, e:
        print e
        traceback.print_exc()
        sleeptime=random.randint(13,40*index)
        print url
        time.sleep(sleeptime)
        continue
    if response.getcode() != 200:
        continue
    else:
        break
return response.read()

我发现我的代码有时会在 return response.read() 上休眠，但程序并没有死，也没有错误或异常，我不知道为什么以及如何发生。我该如何解决？

是python，我想在网上弄点图片。

Answer 1

我想它可能是因为连接超时而休眠了。

urllib.urlopen可以通过timeout参数设置超时。(python3)

如果未设置，则使用套接字默认超时。

并且默认套接字超时为 -1.0，即没有设置，没有超时。

所以试试这个，

response=urllib2.urlopen(req, timeout=3)

或者，在 python2

import socket
setdefaulttimeout(3.0)

无论如何，使用 requests 而不是 urllib2

Answer 2

response.read 从服务器读取 HTTP 响应。这可能需要一段时间，因为读取需要等待字节通过网络到达。

从网络上获取资源需要时间，没有办法解决这个问题。

也就是说，您可以以非阻塞方式访问网络，并在数据可用时收到通知。这不会改变获取资源需要时间的事实。

我的代码有时会在 response.read() 函数上休眠

my code sleep on response.read() function, sometimes

python

web-crawler