如何判断`yield scrapy.Request`返回的生成器是否有数据？

Question

在Scrapy Tutorial中，蜘蛛从class="next"中提取下一页链接并抓取它们-

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

就我而言，我无法在从网络服务器下载的文件中找到下一页链接，但我知道格式是 response.url 与 /page/[page number]/ 连接。不产生引号的请求页面仍然 return 和 response，例如 - No quotes found!。由于下一页的数量通常少于 20，我可以通过将蜘蛛的最后 3 行替换为 -

来遍历所有可能的 url

for page_num in range(2, 20):
    yield response.follow(f"/page/{page_num}/", callback=self.parse)

然而这会强制蜘蛛请求页面（例如http://quotes.toscrape.com/page/11 to 20) which don't yield quotes. How can I adjust my spider to terminate the page_num loop after requesting the first page which does not yield quotes? (such as http://quotes.toscrape.com/page/11）

伪代码-

    page_num = 2
    while (quotes are yielded from the response):
        yield response.follow(f"/page/{page_num}/", callback=self.parse)
        page_num += 1

Answer 1

您可以使用 response.css('..') 的结果作为下一页的条件。
在这种情况下，您的代码将是这样的：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        page_num = get_pagenumber_from_url(response.url)
        
        quotes_sel = response.css('div.quote'):
        # quotes_sel - will be SelectorList if page have item data
        # quotes_sel - will be None if page doesn't have item data
        for quote in quotes_sel:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        if quotes_sel:
            next_page_url = f"/page/{str(page_num+1)}"
            yield response.follow(next_page_url , callback=self.parse)

如何判断`yield scrapy.Request`返回的生成器是否有数据？

How to determine if the generator returned from `yield scrapy.Request` has any data?

python

generator

scrapy