为什么在 Scrapy 中调用 yield 后回调没有立即执行?

Why doesn't a callback get executed immediately upon calling yield in Scrapy?

我正在构建一个网络抓取工具来抓取远程作业。蜘蛛的行为方式我不明白,如果有人能解释原因,我将不胜感激。

蜘蛛代码如下:

import scrapy
import time

class JobsSpider(scrapy.Spider):
    name = "jobs"
    start_urls = [
        "https://whosebug.com/jobs/remote-developer-jobs"
    ]
    already_visited_links = []

    def parse(self, response):
        jobs = response.xpath("//div[contains(@class, 'job')]")
        links_to_next_pages = response.xpath("//a[contains(@class, 's-pagination--item')]").css("a::attr(href)").getall()

        # visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
        for job in jobs:
            job_id = int(job.xpath('@data-jobid').extract_first()) # there will always be one element
            # now visit the link with the job_id and get the info
            job_link_to_visit = "https://whosebug.com/jobs?id=" + str(job_id)
            request = scrapy.Request(job_link_to_visit,
                             callback=self.parse_job)
            yield request

        # sleep for 10 seconds before requesting the next page
        print("Sleeping for 10 seconds...")
        time.sleep(10)

        # go to the next job listings page (if you haven't already been there)
        # not sure if this solution is the best since it has a loop which has a recursion in it
        for link_to_next_page in links_to_next_pages:
            if link_to_next_page not in self.already_visited_links:
                self.already_visited_links.append(link_to_next_page)
                yield response.follow(link_to_next_page, callback=self.parse)

        print("End of parse method")

    def parse_job(self, response):
        print(response.body)
        print("Sleeping for 10 seconds...")
        time.sleep(10)
        pass

这是输出(相关部分):

Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525754> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525748> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=497114> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=523136> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525730> (referer: https://whosebug.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=523319> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522480> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=511761> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522483> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=249610> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522481> (referer: https://whosebug.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...

我不明白为什么 parse 方法在 parse_job 方法被调用之前完全执行。 据我了解,只要我yield 来自 jobsjob,应该调用 parse_job 方法。蜘蛛应该遍历工作列表的每一页,并在该工作列表页面访问每个工作的详细信息。但是,我在上一句中给出的描述与输出不匹配。我也不明白为什么每次调用 parse_job 方法之间有多个 GET 请求。

谁能解释一下这是怎么回事?

Scrapy 是事件驱动的。首先,请求按 Scheduler 排队。排队的请求被传递给 Downloader。下载并准备好响应时调用回调函数,然后将响应作为第一个参数传递给回调函数。

您正在使用 time.sleep() 阻止回调。在提供的日志中,在第一次回调调用后,该过程在 parsed_job() 中被阻塞了 10 秒,但同时 Downloader 正在工作并为回调函数准备好响应,因为在连续的 DEBUG: Crawled (200) 在第一个 parse_job() 调用后记录。因此,当回调被阻止时,Downloader 完成了它的工作,并且响应被排队等待提供给回调函数。正如在日志的最后一部分中显而易见的那样,将响应传递给回调函数是瓶颈。

如果你想在请求之间设置延迟,最好使用 DOWNLOAD_DELAY 设置而不是 time.sleep()

查看 this 了解有关 Scrapy 架构的更多详细信息。