Python 碎片 |如何将蜘蛛的响应传递给主函数
Python Scrapy | How to pass the response to the main function from the spider
我已经尝试 google 广泛的解决方案,但可能没有使用正确的关键字。我知道我可以使用 shell 来立即使用 CSS 和 XPath 选择器,但我想知道这是否可以在蜘蛛之外的 IDE 环境中进行class,即在另一个单元格中。
示例代码:
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ["https://www.example.com"]
def parse(self, response):
URL = "www.example.com/1/"
yield response
然后我希望能够在另一个单元格中使用此响应和选择器:
table_rows = response.xpath("//div[@class='example']/table/tr") # produces error
print(table_rows.xpath("td[4]//text()")[0] .get()
它产生错误:
NameError: name 'response' is not defined
任何 assistance/guidance 将不胜感激。
如果我理解正确的话,你想让蜘蛛return响应并在主脚本中解析它?
main.py:
from scrapy.crawler import CrawlerProcess, CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
from scrapy import signals
def spider_output(spider):
output = []
def get_output(item):
output.append(item)
dispatcher.connect(get_output, signal=signals.item_scraped)
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
return output
if __name__ == "__main__":
spider = "exampleSpider"
response = spider_output(spider)
response = response[0]['response']
title = response.xpath('//h3//text()').get()
price = response.xpath('//div[@class="card-body"]/h4/text()').get()
print(f"Title: {title}")
print(f"Price: {price}")
我们启动蜘蛛并将产生的项目附加到 output
。由于 output
只有一个值,我们不必循环并只取第一个值 response[0]
。然后我们要从键 response
中获取值,所以 response = response[0]['response']
.
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def parse(self, response):
yield {'response': response}
这里我们 return 一个带有响应的项目。
步骤是:main->spider_output->spider->return response item to spider_output ->append the items to output list -> return output到 main -> 从输出中获取响应 -> 解析响应。
输出:
Title: Long-sleeved Jersey Top
Price: .99
我已经尝试 google 广泛的解决方案,但可能没有使用正确的关键字。我知道我可以使用 shell 来立即使用 CSS 和 XPath 选择器,但我想知道这是否可以在蜘蛛之外的 IDE 环境中进行class,即在另一个单元格中。
示例代码:
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ["https://www.example.com"]
def parse(self, response):
URL = "www.example.com/1/"
yield response
然后我希望能够在另一个单元格中使用此响应和选择器:
table_rows = response.xpath("//div[@class='example']/table/tr") # produces error
print(table_rows.xpath("td[4]//text()")[0] .get()
它产生错误:
NameError: name 'response' is not defined
任何 assistance/guidance 将不胜感激。
如果我理解正确的话,你想让蜘蛛return响应并在主脚本中解析它?
main.py:
from scrapy.crawler import CrawlerProcess, CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
from scrapy import signals
def spider_output(spider):
output = []
def get_output(item):
output.append(item)
dispatcher.connect(get_output, signal=signals.item_scraped)
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
return output
if __name__ == "__main__":
spider = "exampleSpider"
response = spider_output(spider)
response = response[0]['response']
title = response.xpath('//h3//text()').get()
price = response.xpath('//div[@class="card-body"]/h4/text()').get()
print(f"Title: {title}")
print(f"Price: {price}")
我们启动蜘蛛并将产生的项目附加到 output
。由于 output
只有一个值,我们不必循环并只取第一个值 response[0]
。然后我们要从键 response
中获取值,所以 response = response[0]['response']
.
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def parse(self, response):
yield {'response': response}
这里我们 return 一个带有响应的项目。
步骤是:main->spider_output->spider->return response item to spider_output ->append the items to output list -> return output到 main -> 从输出中获取响应 -> 解析响应。
输出:
Title: Long-sleeved Jersey Top
Price: .99