使用 Scrapy 从网站中提取所有下一页

Question

我正在尝试从 URL 中抓取所有页面：https://www.residentialpeople.com/za/property-for-sale/cape-town/?country=za&listing_type=residential&transaction_type=sale&longitude=18.49144&latitude=-33.98983&size_qualifier=square_feet&location_slug=cape-town&sort_by=closest_to_farthest&offset=0&limit=10&active=1&_radius_expansion=0&_location=Cape%20Town,%20South%20Africa&status_available_only=0

但是，它只抓取了前 4 页，然后就停止了这是代码：

    def parse(self, response):
        # follow links to property pages
        for href in response.xpath('//div[@class="listings-item-bottom"]//a[@class="link link--minimal"]/@href').getall():
            yield response.follow(href, self.parse_property)

        # follow pagination links
        old_offset = self.page_counter
        old_offset = str(old_offset) + '0' if old_offset != 0 else str(old_offset)

        try:
            max_page = int(''.join(response.css('div.custom-pagination-select::text').re(r'\d+')))
        except:
            max_page = None

        self.page_counter += 1
        if self.page_counter < max_page:
            new_offset = str(self.page_counter) + '0'

            next_page_url = response._get_url().replace(f'offset={old_offset}', f'offset={new_offset}')
            next_page = response.urljoin(next_page_url)
            yield scrapy.Request(next_page, callback=self.parse)

有人对这里可能出现的问题有什么建议吗？提前致谢！

Answer 1

我认为您唯一需要替换的是 URL 中的偏移量以转到下一页。

显然，您可能希望对此进行概括，以便您始终可以根据每次搜索的结果数获取每个页面。

代码示例

def parse(self,response):
    for href in response.xpath('//div[@class="listings-item-bottom"]//a[@class="link link--minimal"]/@href').getall():
        yield response.follow(href, self.parse_property)
    
    results_num = int(response.xpath('//div[@class="total-available-results"]/span/text()').get())

    for i in range(10,results_num+10,10):
        url = f'https://www.residentialpeople.com/za/property-for-sale/cape-town/?country=za&listing_type=residential&transaction_type=sale&longitude=18.49144&latitude=-33.98983&size_qualifier=square_feet&location_slug=cape-town&sort_by=closest_to_farthest&offset={i}&limit=10&active=1&status_available_only=0&_radius_expansion=0&_location=Cape%20Town,%20South%20Africa'
        yield scrapy.Request(url=url, callback=self.parse,dont_filter=True)

解释

results_num 给出了属性的总数。我们使用从偏移量 10 开始的 for 循环，在这种情况下最多为 15720，请记住我们需要将 +10 添加到结束参数，因为范围中的结束参数不包括最多 results_num。 10的step参数就是我们想要的。

我们根据请求动态创建 URL，在 for 循环中使用 f-strings，我们为变量 i 提供上面解释的我们想要的值。我们可以使用它为每次 for 循环迭代所需的偏移量创建新字符串 URL。然后我们可以在每次迭代中发出一个请求，并回调解析函数。请记住，因为 URL 的基数相同，scrapy 会过滤掉它，所以在 Request 中我们指定 dont_filter=True

使用 Scrapy 从网站中提取所有下一页

Extract all next pages from website with Scrapy

python

xpath

web-crawler

scrapy

web-scraping

代码示例

解释