转到子页面时 Scrapy 不起作用

Scrapy is not working when go to child pages

我想在每个子页面中获取一些信息,这样就可以了。 但是代码并没有进入子页面而是进入下一页。 应该做的是把子页面里面的数据拿到底部元素到页面末尾再换页

import scrapy
from ukparl.items import UkparlItem


class UkparlSpider(scrapy.Spider):
    name = 'ukparldata'

    # allowed_domains = ["https://members.parliament.uk/"]
    start_urls = ['https://members.parliament.uk/members/commons?page=1']

    def parse(self, response):

        nextpageurl = response.xpath('//a[@title="Go to next page"]/@href')

        yield from self.scrape(response)

        if nextpageurl:
            path = nextpageurl.extract_first()
            nextpage = response.urljoin(path)
            print("Found url: {}".format(nextpage))
            yield scrapy.Request(nextpage, callback=self.parse)

    def scrape(self, response):
        for resource in response.xpath('//div[@class="primary-info"]/..'):
            item = UkparlItem()

            item['name'] = resource.xpath('div[@class="primary-info"]/text()').extract_first()

            profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
            item['link'] = profilepage
            item['party'] = resource.xpath('div[@class="secondary-info"]/text()').extract_first()
            item['region'] = resource.xpath('//div[@class="indicator indicator-label"]/text()').extract_first()

            request = scrapy.Request(profilepage, callback=self.get_data)
            request.meta['item'] = item
            yield request
                
    def get_data(self, response):
        item = response.meta['item']
        item['phonenumber'] = response.xpath('//div[@class="contact-line"]/a/text()').extract_first()
        item['twitter'] = response.xpath('//a[@class="card card-contact-info"][2]/@href').extract()
        yield item

确实进入了'child page',可以在'get_data'函数中添加print(item)来查看

问题是你一遍又一遍地抓取同一个页面:

<GET https://members.parliament.uk/members/commons?page=2> (referer: https://members.parliament.uk/members/commons?page=1)
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
And so on

您可以更改:

profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())

to

profilepage = response.urljoin(resource.xpath('../../@href').extract_first())

(I think it's better to replace response.xpath('//div[@class="primary-info"]/..') with something else but it's just my opinion)

如您所见:

DEBUG: Crawled (200) <GET https://members.parliament.uk/member/3922/contact> (referer: https://members.parliament.uk/members/commons?page=5)
<GET https://members.parliament.uk/member/3986/contact>
<GET https://members.parliament.uk/member/4769/contact>
<GET https://members.parliament.uk/member/1538/contact>
<GET https://members.parliament.uk/member/420/contact>
<GET https://members.parliament.uk/member/185/contact>
<GET https://members.parliament.uk/member/4439/contact>
<GET https://members.parliament.uk/member/4589/contact>
<GET https://members.parliament.uk/member/4806/contact>
<GET https://members.parliament.uk/member/4465/contact>
<GET https://members.parliament.uk/member/1508/contact>
<GET https://members.parliament.uk/member/4368/contact>
<GET https://members.parliament.uk/member/1554/contact>
<GET https://members.parliament.uk/member/4469/contact>
<GET https://members.parliament.uk/member/4088/contact>
<GET https://members.parliament.uk/member/4859/contact>
<GET https://members.parliament.uk/member/3950/contact>
<GET https://members.parliament.uk/member/1406/contact>