最大化网络爬虫中的带宽使用

Question

我目前在专注于新闻网站的网络爬虫工作。但它花费的时间太长，我相信这是因为脚本当时打开一个页面，将其废弃，然后转到另一页。所以有时间发送请求，从服务器获取响应等等。有没有一种方法可以让我一次打开多个页面并最大限度地利用带宽？

def get_links(url):
    html = urlopen(url)
    bsObj = bs(html)
    for link in bsObj.find_all('a', href=re.compile("^(http://www1.folha.uol.com.br/)(.)*$")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in urls:
                urls.add(link.attrs['href'])
                to_crawl.add(link.attrs['href'])
    if bsObj.find(attrs={'itemprop':'articleBody'}):
        articles.add(url)
        page_append(url)
        print(url)
    urls_crawled.add(url)

Answer 1

您的直觉是正确的。您的代码一次处理一页并且您没有利用所有带宽。您可能想要使用评论中提到的某些特定库 (scrappy.org) 或查看线程。 https://docs.python.org/2/library/threading.html

请注意，使用线程并不像在多个线程中启动代码那么简单。您必须协调他们访问您的 articles 列表的方式。您将需要使用线程安全的东西。 https://docs.python.org/2/library/queue.html

最大化网络爬虫中的带宽使用

Maximizing bandwidth use in web crawler

python

web-crawler

web-scraping