Scrapy CLOSESPIDER_PAGECOUNT 设置不能正常工作

Question

我使用 scrapy 1.0.3，但无法发现 CLOSESPIDER 扩展的工作原理。对于命令： scrapy 爬行 domain_links --set=CLOSESPIDER_PAGECOUNT=1 正确的是一个请求，但对于两页计数： scrapy 爬行 domain_links --set CLOSESPIDER_PAGECOUNT=2 是无限的请求。

所以请用简单的例子向我解释它是如何工作的。

这是我的爬虫代码：

class DomainLinksSpider(CrawlSpider):
    name = "domain_links"
    #allowed_domains = ["www.example.org"]
    start_urls = [ "www.example.org/",]

    rules = (

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'),
    )

    def parse_page(self, response):
        print '<<<',response.url
        items = []
        item = PathsSpiderItem()

        selected_links = response.selector.xpath('//a[@href]')

        for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response):
            item = PathsSpiderItem()
            item['url'] = link.url
            items.append(item)
        return items

甚至对这个简单的蜘蛛都不起作用：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['karen.pl']
    start_urls = ['http://www.karen.pl']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).


        # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()

        return item

但不是无穷大：

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=1 'downloader/request_count': 1,

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=2 'downloader/request_count': 17,

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=3 'downloader/request_count': 19,

可能是因为并行下载。是的，对于 CONCURRENT_REQUESTS = 1，CLOSESPIDER_PAGECOUNT 设置适用于第二个示例。我会检查第一个 - 它也有效。这对我来说几乎是无限的，因为包含许多 url（我的项目）的站点地图被抓取为下一页 :)

Answer 1

CLOSESPIDER_PAGECOUNT 由 CloseSpider 扩展控制，它计算每个响应直到达到其限制，即它告诉爬虫进程开始结束（完成请求并关闭可用插槽）的时间。

现在，当您指定 CLOSESPIDER_PAGECOUNT=1 时，您的蜘蛛程序结束的原因是因为在那一刻（当它收到第一个响应时）没有 pending个请求，它们是在你第一个之后创建的，所以爬虫进程准备结束，不考虑后面的（因为它们会在第一个之后生成）。

当您指定 CLOSESPIDER_PAGECOUNT>1 时，您的蜘蛛会被捕获并创建请求并填充请求队列。当蜘蛛知道何时完成时，仍然有待处理的请求要处理，这些请求作为关闭蜘蛛的一部分执行。

Scrapy CLOSESPIDER_PAGECOUNT 设置不能正常工作

Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

python

web-crawler

scrapy