为什么在 Scrapy 中调用 yield 后回调没有立即执行?
Why doesn't a callback get executed immediately upon calling yield in Scrapy?
我正在构建一个网络抓取工具来抓取远程作业。蜘蛛的行为方式我不明白,如果有人能解释原因,我将不胜感激。
蜘蛛代码如下:
import scrapy
import time
class JobsSpider(scrapy.Spider):
name = "jobs"
start_urls = [
"https://whosebug.com/jobs/remote-developer-jobs"
]
already_visited_links = []
def parse(self, response):
jobs = response.xpath("//div[contains(@class, 'job')]")
links_to_next_pages = response.xpath("//a[contains(@class, 's-pagination--item')]").css("a::attr(href)").getall()
# visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
for job in jobs:
job_id = int(job.xpath('@data-jobid').extract_first()) # there will always be one element
# now visit the link with the job_id and get the info
job_link_to_visit = "https://whosebug.com/jobs?id=" + str(job_id)
request = scrapy.Request(job_link_to_visit,
callback=self.parse_job)
yield request
# sleep for 10 seconds before requesting the next page
print("Sleeping for 10 seconds...")
time.sleep(10)
# go to the next job listings page (if you haven't already been there)
# not sure if this solution is the best since it has a loop which has a recursion in it
for link_to_next_page in links_to_next_pages:
if link_to_next_page not in self.already_visited_links:
self.already_visited_links.append(link_to_next_page)
yield response.follow(link_to_next_page, callback=self.parse)
print("End of parse method")
def parse_job(self, response):
print(response.body)
print("Sleeping for 10 seconds...")
time.sleep(10)
pass
这是输出(相关部分):
Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525754> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525748> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=497114> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=523136> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525730> (referer: https://whosebug.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=523319> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522480> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=511761> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522483> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=249610> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522481> (referer: https://whosebug.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...
我不明白为什么 parse
方法在 parse_job
方法被调用之前完全执行。 据我了解,只要我yield
来自 jobs
的 job
,应该调用 parse_job
方法。蜘蛛应该遍历工作列表的每一页,并在该工作列表页面访问每个工作的详细信息。但是,我在上一句中给出的描述与输出不匹配。我也不明白为什么每次调用 parse_job
方法之间有多个 GET
请求。
谁能解释一下这是怎么回事?
Scrapy 是事件驱动的。首先,请求按 Scheduler
排队。排队的请求被传递给 Downloader
。下载并准备好响应时调用回调函数,然后将响应作为第一个参数传递给回调函数。
您正在使用 time.sleep()
阻止回调。在提供的日志中,在第一次回调调用后,该过程在 parsed_job()
中被阻塞了 10 秒,但同时 Downloader
正在工作并为回调函数准备好响应,因为在连续的 DEBUG: Crawled (200)
在第一个 parse_job()
调用后记录。因此,当回调被阻止时,Downloader
完成了它的工作,并且响应被排队等待提供给回调函数。正如在日志的最后一部分中显而易见的那样,将响应传递给回调函数是瓶颈。
如果你想在请求之间设置延迟,最好使用 DOWNLOAD_DELAY
设置而不是 time.sleep()
。
查看 this 了解有关 Scrapy 架构的更多详细信息。
我正在构建一个网络抓取工具来抓取远程作业。蜘蛛的行为方式我不明白,如果有人能解释原因,我将不胜感激。
蜘蛛代码如下:
import scrapy
import time
class JobsSpider(scrapy.Spider):
name = "jobs"
start_urls = [
"https://whosebug.com/jobs/remote-developer-jobs"
]
already_visited_links = []
def parse(self, response):
jobs = response.xpath("//div[contains(@class, 'job')]")
links_to_next_pages = response.xpath("//a[contains(@class, 's-pagination--item')]").css("a::attr(href)").getall()
# visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
for job in jobs:
job_id = int(job.xpath('@data-jobid').extract_first()) # there will always be one element
# now visit the link with the job_id and get the info
job_link_to_visit = "https://whosebug.com/jobs?id=" + str(job_id)
request = scrapy.Request(job_link_to_visit,
callback=self.parse_job)
yield request
# sleep for 10 seconds before requesting the next page
print("Sleeping for 10 seconds...")
time.sleep(10)
# go to the next job listings page (if you haven't already been there)
# not sure if this solution is the best since it has a loop which has a recursion in it
for link_to_next_page in links_to_next_pages:
if link_to_next_page not in self.already_visited_links:
self.already_visited_links.append(link_to_next_page)
yield response.follow(link_to_next_page, callback=self.parse)
print("End of parse method")
def parse_job(self, response):
print(response.body)
print("Sleeping for 10 seconds...")
time.sleep(10)
pass
这是输出(相关部分):
Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525754> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525748> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=497114> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=523136> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=525730> (referer: https://whosebug.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=523319> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522480> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=511761> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522483> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=249610> (referer: https://whosebug.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://whosebug.com/jobs?id=522481> (referer: https://whosebug.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...
我不明白为什么 parse
方法在 parse_job
方法被调用之前完全执行。 据我了解,只要我yield
来自 jobs
的 job
,应该调用 parse_job
方法。蜘蛛应该遍历工作列表的每一页,并在该工作列表页面访问每个工作的详细信息。但是,我在上一句中给出的描述与输出不匹配。我也不明白为什么每次调用 parse_job
方法之间有多个 GET
请求。
谁能解释一下这是怎么回事?
Scrapy 是事件驱动的。首先,请求按 Scheduler
排队。排队的请求被传递给 Downloader
。下载并准备好响应时调用回调函数,然后将响应作为第一个参数传递给回调函数。
您正在使用 time.sleep()
阻止回调。在提供的日志中,在第一次回调调用后,该过程在 parsed_job()
中被阻塞了 10 秒,但同时 Downloader
正在工作并为回调函数准备好响应,因为在连续的 DEBUG: Crawled (200)
在第一个 parse_job()
调用后记录。因此,当回调被阻止时,Downloader
完成了它的工作,并且响应被排队等待提供给回调函数。正如在日志的最后一部分中显而易见的那样,将响应传递给回调函数是瓶颈。
如果你想在请求之间设置延迟,最好使用 DOWNLOAD_DELAY
设置而不是 time.sleep()
。
查看 this 了解有关 Scrapy 架构的更多详细信息。