限制 scrapy 可以收集多少元素

Question

我正在使用 scrapy 收集一些数据。我的 scrapy 程序一次收集了 100 个元素。我需要将它限制为 50 或任何随机数。我怎样才能做到这一点？欢迎任何解决方案。提前致谢

# -*- coding: utf-8 -*-
import re
import scrapy


class DmozItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    attr = scrapy.Field()
    title = scrapy.Field()
    tag = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["raleigh.craigslist.org"]
    start_urls = [
        "http://raleigh.craigslist.org/search/bab"
    ]

    BASE_URL = 'http://raleigh.craigslist.org/'

    def parse(self, response):
        links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_attr)

    def parse_attr(self, response):
        match = re.search(r"(\w+)\.html", response.url)
        if match:
            item_id = match.group(1)
            url = self.BASE_URL + "reply/ral/bab/" + item_id

            item = DmozItem()
            item["link"] = response.url
            item["title"] = "".join(response.xpath("//span[@class='postingtitletext']//text()").extract())
            item["tag"] = "".join(response.xpath("//p[@class='attrgroup']/span/b/text()").extract()[0])
            return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact)

    def parse_contact(self, response):
        item = response.meta['item']
        item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract())
        return item

Answer 1

这是 CloseSpider extension 和 CLOSESPIDER_ITEMCOUNT 的设置：

An integer which specifies a number of items. If the spider scrapes more than that amount if items and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. If zero (or non set), spiders won’t be closed by number of passed items.

Answer 2

我尝试了答案，但我必须结合所有 3 个限制才能使其正常工作，所以将它留在这里以防万一其他人遇到同样的问题：

class GenericWebsiteSpider(scrapy.Spider):
    """This generic website spider extracts text from websites"""

    name = "generic_website"

    custom_settings = {
        'CLOSESPIDER_PAGECOUNT': 15,
        'CONCURRENT_REQUESTS': 15,
        'CLOSESPIDER_ITEMCOUNT': 15
    }
...

限制 scrapy 可以收集多少元素

Limit how much elements scrapy can collect

python

web-crawler

scrapy

web-scraping