在 scrapy 中的异步方法中发出内联请求时不能使用 `headers` 或 `dont_filter=True`
Can't use `headers` or `dont_filter=True` while making inline requests in async method within scrapy
我创建了一个脚本来从 yellowpages.com 抓取不同商店的 name
、phone
和 email
地址。我在 scrapy 中使用 async
方法从内页解析电子邮件地址,同时从着陆页解析名称和 phone 。脚本运行良好。
我无法理解的是如何在内联请求中使用 headers
或 dont_filter=True
。以下是我的意思。
request = response.follow(email_url)
resp = await self.crawler.engine.download(request, self)
我正在使用的蜘蛛:
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
}
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,headers=self.headers)
async def parse(self, response):
for items in response.css("[class='result'] .v-card > .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
request = response.follow(email_url)
resp = await self.crawler.engine.download(request, self)
email = resp.css("a.email-business[href^='mailto:']::attr(href)").get()
yield {"Shop name": name, "Phone": phone, "Email": email}
if __name__ == "__main__":
c = CrawlerProcess()
c.crawl(YellowpagesSpider)
c.start()
您可以在 follow
本身传递它。 follow
方法采用 __init__
支持的所有参数
response.follow(email_url, dont_filter=True, headers=self.headers)
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow
我创建了一个脚本来从 yellowpages.com 抓取不同商店的 name
、phone
和 email
地址。我在 scrapy 中使用 async
方法从内页解析电子邮件地址,同时从着陆页解析名称和 phone 。脚本运行良好。
我无法理解的是如何在内联请求中使用 headers
或 dont_filter=True
。以下是我的意思。
request = response.follow(email_url)
resp = await self.crawler.engine.download(request, self)
我正在使用的蜘蛛:
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
}
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,headers=self.headers)
async def parse(self, response):
for items in response.css("[class='result'] .v-card > .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
request = response.follow(email_url)
resp = await self.crawler.engine.download(request, self)
email = resp.css("a.email-business[href^='mailto:']::attr(href)").get()
yield {"Shop name": name, "Phone": phone, "Email": email}
if __name__ == "__main__":
c = CrawlerProcess()
c.crawl(YellowpagesSpider)
c.start()
您可以在 follow
本身传递它。 follow
方法采用 __init__
支持的所有参数
response.follow(email_url, dont_filter=True, headers=self.headers)
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow