转到子页面时 Scrapy 不起作用
Scrapy is not working when go to child pages
我想在每个子页面中获取一些信息,这样就可以了。
但是代码并没有进入子页面而是进入下一页。
应该做的是把子页面里面的数据拿到底部元素到页面末尾再换页
import scrapy
from ukparl.items import UkparlItem
class UkparlSpider(scrapy.Spider):
name = 'ukparldata'
# allowed_domains = ["https://members.parliament.uk/"]
start_urls = ['https://members.parliament.uk/members/commons?page=1']
def parse(self, response):
nextpageurl = response.xpath('//a[@title="Go to next page"]/@href')
yield from self.scrape(response)
if nextpageurl:
path = nextpageurl.extract_first()
nextpage = response.urljoin(path)
print("Found url: {}".format(nextpage))
yield scrapy.Request(nextpage, callback=self.parse)
def scrape(self, response):
for resource in response.xpath('//div[@class="primary-info"]/..'):
item = UkparlItem()
item['name'] = resource.xpath('div[@class="primary-info"]/text()').extract_first()
profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
item['link'] = profilepage
item['party'] = resource.xpath('div[@class="secondary-info"]/text()').extract_first()
item['region'] = resource.xpath('//div[@class="indicator indicator-label"]/text()').extract_first()
request = scrapy.Request(profilepage, callback=self.get_data)
request.meta['item'] = item
yield request
def get_data(self, response):
item = response.meta['item']
item['phonenumber'] = response.xpath('//div[@class="contact-line"]/a/text()').extract_first()
item['twitter'] = response.xpath('//a[@class="card card-contact-info"][2]/@href').extract()
yield item
确实进入了'child page',可以在'get_data'函数中添加print(item)来查看
问题是你一遍又一遍地抓取同一个页面:
<GET https://members.parliament.uk/members/commons?page=2> (referer: https://members.parliament.uk/members/commons?page=1)
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
And so on
您可以更改:
profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
to
profilepage = response.urljoin(resource.xpath('../../@href').extract_first())
(I think it's better to replace response.xpath('//div[@class="primary-info"]/..') with something else but it's just my opinion)
如您所见:
DEBUG: Crawled (200) <GET https://members.parliament.uk/member/3922/contact> (referer: https://members.parliament.uk/members/commons?page=5)
<GET https://members.parliament.uk/member/3986/contact>
<GET https://members.parliament.uk/member/4769/contact>
<GET https://members.parliament.uk/member/1538/contact>
<GET https://members.parliament.uk/member/420/contact>
<GET https://members.parliament.uk/member/185/contact>
<GET https://members.parliament.uk/member/4439/contact>
<GET https://members.parliament.uk/member/4589/contact>
<GET https://members.parliament.uk/member/4806/contact>
<GET https://members.parliament.uk/member/4465/contact>
<GET https://members.parliament.uk/member/1508/contact>
<GET https://members.parliament.uk/member/4368/contact>
<GET https://members.parliament.uk/member/1554/contact>
<GET https://members.parliament.uk/member/4469/contact>
<GET https://members.parliament.uk/member/4088/contact>
<GET https://members.parliament.uk/member/4859/contact>
<GET https://members.parliament.uk/member/3950/contact>
<GET https://members.parliament.uk/member/1406/contact>
我想在每个子页面中获取一些信息,这样就可以了。 但是代码并没有进入子页面而是进入下一页。 应该做的是把子页面里面的数据拿到底部元素到页面末尾再换页
import scrapy
from ukparl.items import UkparlItem
class UkparlSpider(scrapy.Spider):
name = 'ukparldata'
# allowed_domains = ["https://members.parliament.uk/"]
start_urls = ['https://members.parliament.uk/members/commons?page=1']
def parse(self, response):
nextpageurl = response.xpath('//a[@title="Go to next page"]/@href')
yield from self.scrape(response)
if nextpageurl:
path = nextpageurl.extract_first()
nextpage = response.urljoin(path)
print("Found url: {}".format(nextpage))
yield scrapy.Request(nextpage, callback=self.parse)
def scrape(self, response):
for resource in response.xpath('//div[@class="primary-info"]/..'):
item = UkparlItem()
item['name'] = resource.xpath('div[@class="primary-info"]/text()').extract_first()
profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
item['link'] = profilepage
item['party'] = resource.xpath('div[@class="secondary-info"]/text()').extract_first()
item['region'] = resource.xpath('//div[@class="indicator indicator-label"]/text()').extract_first()
request = scrapy.Request(profilepage, callback=self.get_data)
request.meta['item'] = item
yield request
def get_data(self, response):
item = response.meta['item']
item['phonenumber'] = response.xpath('//div[@class="contact-line"]/a/text()').extract_first()
item['twitter'] = response.xpath('//a[@class="card card-contact-info"][2]/@href').extract()
yield item
确实进入了'child page',可以在'get_data'函数中添加print(item)来查看
问题是你一遍又一遍地抓取同一个页面:
<GET https://members.parliament.uk/members/commons?page=2> (referer: https://members.parliament.uk/members/commons?page=1)
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
And so on
您可以更改:
profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
to
profilepage = response.urljoin(resource.xpath('../../@href').extract_first())
(I think it's better to replace response.xpath('//div[@class="primary-info"]/..') with something else but it's just my opinion)
如您所见:
DEBUG: Crawled (200) <GET https://members.parliament.uk/member/3922/contact> (referer: https://members.parliament.uk/members/commons?page=5)
<GET https://members.parliament.uk/member/3986/contact>
<GET https://members.parliament.uk/member/4769/contact>
<GET https://members.parliament.uk/member/1538/contact>
<GET https://members.parliament.uk/member/420/contact>
<GET https://members.parliament.uk/member/185/contact>
<GET https://members.parliament.uk/member/4439/contact>
<GET https://members.parliament.uk/member/4589/contact>
<GET https://members.parliament.uk/member/4806/contact>
<GET https://members.parliament.uk/member/4465/contact>
<GET https://members.parliament.uk/member/1508/contact>
<GET https://members.parliament.uk/member/4368/contact>
<GET https://members.parliament.uk/member/1554/contact>
<GET https://members.parliament.uk/member/4469/contact>
<GET https://members.parliament.uk/member/4088/contact>
<GET https://members.parliament.uk/member/4859/contact>
<GET https://members.parliament.uk/member/3950/contact>
<GET https://members.parliament.uk/member/1406/contact>