在 scrapy 中的异步方法中发出内联请求时不能使用 `headers` 或 `dont_filter=True`

Can't use `headers` or `dont_filter=True` while making inline requests in async method within scrapy

我创建了一个脚本来从 yellowpages.com 抓取不同商店的 namephoneemail 地址。我在 scrapy 中使用 async 方法从内页解析电子邮件地址,同时从着陆页解析名称和 phone 。脚本运行良好。

我无法理解的是如何在内联请求中使用 headersdont_filter=True。以下是我的意思。

request = response.follow(email_url)
resp = await self.crawler.engine.download(request, self)

我正在使用的蜘蛛:

import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
    name = "yellowpages"
    start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    }

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,headers=self.headers)

    async def parse(self, response):
        for items in response.css("[class='result'] .v-card > .info"):
            name = items.css("a.business-name > span::text").get()
            phone = items.css("div.phones::text").get()
            email_url = items.css("a.business-name::attr(href)").get()

            request = response.follow(email_url)
            resp = await self.crawler.engine.download(request, self)
            email = resp.css("a.email-business[href^='mailto:']::attr(href)").get()
            yield {"Shop name": name, "Phone": phone, "Email": email}
            
if __name__ == "__main__":
    c = CrawlerProcess()
    c.crawl(YellowpagesSpider)
    c.start()

您可以在 follow 本身传递它。 follow 方法采用 __init__ 支持的所有参数

response.follow(email_url, dont_filter=True, headers=self.headers)

https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow