对于下面的场景,如何使用 Scrapy 获取第二页的内容?
How can I get the contents of the second page as well with Scrapy for the scenario below?
我有一个蜘蛛需要获取一个对象数组,其中每个对象有 5 个项目。 4 个项目在同一页上,第 5 个项目是 URL,我需要从中提取数据,return 所有 5 个项目作为文本。在下面的代码片段中,解释是位于另一页的关键。我需要解析它并在生成它时将其数据与其他属性一起添加。
导出到 JSON 文件时,我当前的解决方案如下所示。正如您所注意到的,我的 "e" 没有得到解决。如何获取数据?
[
{
"q": "How many pairs of integers (x, y) exist such that the product of x, y and HCF (x, y) = 1080?",
"c": [
"8",
"7",
"9",
"12"
],
"a": "Choice (C).9",
"e": "<Request GET http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/hcf-lcm_1.shtml>",
"d": "Hard"
}
]
class CatSpider(scrapy.Spider):
name = "catspider"
start_urls = [
'http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/'
]
def parse_solution(self, response):
yield response.xpath('//p[@class="soln"]').extract_first()
def parse(self, response):
for lis in response.xpath('//ol[@class="z1"]/li'):
questions = lis.xpath('.//p[@lang="title"]/text()').extract_first()
choices = lis.xpath(
'.//ol[contains(@class, "xyz")]/li/text()').extract()
answer = lis.xpath(
'.//ul[@class="exp"]/li/span/span/text()').extract_first()
explanation = lis.xpath(
'.//ul[@class="exp"]/li[2]/a/@href').extract_first()
difficulty = lis.xpath(
'.//ul[@class="exp"]/li[last()]/text()').extract_first()
if questions and choices and answer and explanation and difficulty:
yield {
'q': questions,
'c': choices,
'a': answer,
'e': scrapy.Request(response.urljoin(explanation), callback=self.parse_solution),
'd': difficulty
}
Scrapy 是一个异步框架,这意味着它的 none 个元素是阻塞的。所以 Request
作为一个对象什么都不做,它只是存储 scrapy 下载器的信息,因此这意味着你不能像现在这样调用它来下载东西。
常见的解决方案是通过回调携带数据来设计爬网链:
def parse(self, response):
item = dict()
item['foo'] = 'foo is great'
next_page = 'http://nextpage.com'
return Request(next_page,
callback=self.parse2,
meta={'item': item}) # put our item in meta
def parse2(self, response):
item = response.meta['item'] # take our item from the meta
item['bar'] = 'bar is great too!'
return item
我有一个蜘蛛需要获取一个对象数组,其中每个对象有 5 个项目。 4 个项目在同一页上,第 5 个项目是 URL,我需要从中提取数据,return 所有 5 个项目作为文本。在下面的代码片段中,解释是位于另一页的关键。我需要解析它并在生成它时将其数据与其他属性一起添加。
导出到 JSON 文件时,我当前的解决方案如下所示。正如您所注意到的,我的 "e" 没有得到解决。如何获取数据?
[
{
"q": "How many pairs of integers (x, y) exist such that the product of x, y and HCF (x, y) = 1080?",
"c": [
"8",
"7",
"9",
"12"
],
"a": "Choice (C).9",
"e": "<Request GET http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/hcf-lcm_1.shtml>",
"d": "Hard"
}
]
class CatSpider(scrapy.Spider):
name = "catspider"
start_urls = [
'http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/'
]
def parse_solution(self, response):
yield response.xpath('//p[@class="soln"]').extract_first()
def parse(self, response):
for lis in response.xpath('//ol[@class="z1"]/li'):
questions = lis.xpath('.//p[@lang="title"]/text()').extract_first()
choices = lis.xpath(
'.//ol[contains(@class, "xyz")]/li/text()').extract()
answer = lis.xpath(
'.//ul[@class="exp"]/li/span/span/text()').extract_first()
explanation = lis.xpath(
'.//ul[@class="exp"]/li[2]/a/@href').extract_first()
difficulty = lis.xpath(
'.//ul[@class="exp"]/li[last()]/text()').extract_first()
if questions and choices and answer and explanation and difficulty:
yield {
'q': questions,
'c': choices,
'a': answer,
'e': scrapy.Request(response.urljoin(explanation), callback=self.parse_solution),
'd': difficulty
}
Scrapy 是一个异步框架,这意味着它的 none 个元素是阻塞的。所以 Request
作为一个对象什么都不做,它只是存储 scrapy 下载器的信息,因此这意味着你不能像现在这样调用它来下载东西。
常见的解决方案是通过回调携带数据来设计爬网链:
def parse(self, response):
item = dict()
item['foo'] = 'foo is great'
next_page = 'http://nextpage.com'
return Request(next_page,
callback=self.parse2,
meta={'item': item}) # put our item in meta
def parse2(self, response):
item = response.meta['item'] # take our item from the meta
item['bar'] = 'bar is great too!'
return item