Scrapy - 满足条件后按顺序抓取网址

Scrapy - Crawl urls sequentially after a condition is met

我有一个蜘蛛,它以四个不同的 start_urls 开始,然后继续在里面爬行某些 link。它们都具有 相同的域和结构 ,唯一不同的是它们之间的查询参数。我使用两条规则:一条用于打开和解析每个 link,一条用于分页。

我的问题是:由于分页产生的大量内容,我不想抓取所有 link,因此我需要检查抓取的每个 link 的条件(出版物年),一旦那年与我想要的年份不同,蜘蛛就应该忽略属于那个 start_url 的所有剩余 link 的爬行,然后继续前进到 links 由第二个 start_url 生成。我该怎么做呢?这是我的蜘蛛的代码:

class articleSpider(CrawlSpider):
    name = 'article'
    allowed_domains = ['website.com']
    start_urls = [
        'https://www.website.com/search/?category=value1',
        'https://www.website.com/search/?category=value2',
        'https://www.website.com/search/?category=value3',
        'https://www.website.com/search/?category=value4',
        ]

    rules = (
        Rule(
            LinkExtractor(
                restrict_xpaths="//div[@class='results-post']/article/a"
                ), 
            callback='parse_item', 
            follow=True,
           ),
        Rule(
            LinkExtractor(
                restrict_xpaths="//section[@class='results-navi'][1]/div/div[@class='prevpageNav left']"
                )
            )
        )
    
    def parse_item(self, response):
        name = response.url.strip('/').split('/')[-1]
        date = response.xpath("//section/p/time/@datetime").get()[:4]
        if date == '2020':
            with open(f'./src/data/{name}.html', 'wb') as f:
                f.write(response.text.encode('utf8'))
                return

在此先感谢您的帮助。

我不知道有什么简单的方法可以实现这一点,但也许下面的(未经测试的)代码可以帮助您入门。 逻辑如下:

  • 覆盖 start_requests 以仅从第一个开始-url 仅
  • 传递 meta
  • 中的其他开始-url
  • 在解析方法中,获取项目-urls 和下一页url
  • 浏览第 url 项。当您进入 2020 年时,它会经过 item_urls(如果您 运行 超出项目-url,则进入下一页 url)。遇到不一样的年份就去下一个start_url.
from scrapy import Spider, Request


class articleSpider(Spider):
    name = 'article'
    allowed_domains = ['website.com']
    start_urls = [
        'https://www.website.com/search/?category=value1',
        'https://www.website.com/search/?category=value2',
        'https://www.website.com/search/?category=value3',
        'https://www.website.com/search/?category=value4',
    ]

    def start_requests(self):
        start_urls = self.start_urls
        start_url = start_urls.pop()
        meta = {'start_urls': start_urls}
        yield Request(start_url, callback=self.parse, meta=meta)

    def parse(self, response):
        start_urls = response.meta['start_urls']
        # get item-urls
        item_urls = response.xpath(
            '//div[@class="results-post"]/article/a'
        ).extract()

        # get next page-url
        next_page = response.xpath(
            '//section[@class="results-navi"][1]/div/div[@class="prevpageNav left"]'
        ).extract_first()

        # pass the item-urls and next page in the meta
        item_url = item_urls.pop()
        meta = {
            'next_page': next_page,
            'item_urls': item_urls,
            'start_urls': start_urls
        }
        yield Request(item_url, self.parse_item, meta=meta)

    def parse_item(self, response):
        item_urls = response.meta['item_urls']
        next_page = response.meta['next_page']
        start_urls = response.meta['start_urls']

        name = response.url.strip('/').split('/')[-1]
        date = response.xpath("//section/p/time/@datetime").get()[:4]
        if date == '2020':
            with open(f'./src/data/{name}.html', 'wb') as f:
                f.write(response.text.encode('utf8'))
            try:
                item_url = item_urls.pop()
            except IndexError:
                # all items are done - we go to next page
                if next_page:
                    meta = {'start_urls': start_urls}
                    yield Request(next_page, self.parse, meta=meta)
                else:
                    # no pages left, go to next start_url
                    try:
                        start_url = start_urls.pop()
                    except IndexError:
                        # nothing left to do
                        return
                    else:
                        meta = {'start_urls': start_urls}
                        yield Request(start_url, self.parse, meta=meta)
            else:
                # still items left to process
                meta = {
                    'next_page': next_page,
                    'item_urls': item_urls
                }
                yield Request(item_url, self.parse_item, meta=meta)

        else:
            # go to next start_url
            try:
                start_url = start_urls.pop()
            except IndexError:
                # nothing left to do
                return
            else:
                meta = {'start_urls': start_urls}
                yield Request(start_url, self.parse, meta=meta)