多页爬虫给出了错误的结果

Question

提前感谢您的有用帮助！

我们需要抓取网站 https://www.astegiudiziarie.it/

的所有产品页面并将其保存在我们的 MySQL 数据库中

该网站没有站点地图，因此我们选择摘要网页作为数据源https://www.astegiudiziarie.it/Immobili/Riepilogo

从这里你可以看到第一个页面是区域，然后是省，然后是地区，最后是我们需要抓取和保存的产品页面。

我们正在使用 Scrapy 和 Python 3.8.5

进行开发

在从区域页面到产品页面（条目）的执行流程中，我通过参数元传递数据。

当我测试并打印成 'region'、'province'、'district' 格式的 CSV 文件时，我得到的列值有误。

问题是当我从终端运行 scrapy crawl products -o f.csv :

输出文件包含 'region'、'province'、'district' 的 table，但行内容未按预期正确显示。

我不明白这段代码中有什么错误，

非常感谢您对创建更好网络的回应和支持！

谢谢！

import scrapy

from scrapy.http.request import Request

protocol = 'http://'
domain   = 'www.astegiudiziarie.it'
path     = '/Immobili/Riepilogo'

target_url = protocol + domain + path

dev_entry_counter = 0
dev_entry_limit   = 100

def file_debug (message) :
    f = open('debug.txt', 'a')
    f.write(message + "\n\n")
    f.close()

class ProductsSpider (scrapy.Spider) :
    name = 'products'
    allowed_domains = [domain]
    start_urls = [target_url]

    def parse (self, response) :# Parsing of 'regione' (Layer 1)
        regioni = response.xpath('//table[@id="panoramica"]/tbody/tr')
        
        for regione in regioni :# Iterating rows
            regione_name = regione.xpath('//th[@scope="rowgroup"]//text()').extract_first()
            
            hrefs_l1 = regione.xpath('//td/a/@href').extract()
            
            for href_l1 in hrefs_l1 :# Iterating columns
                abs_href_l1 = target_url + href_l1
                
                yield Request(url = abs_href_l1, callback = self.parse_provincia, meta = {'regione': regione_name})

    def parse_provincia (self, response) :# Parsing of 'provincia' (Layer 2)
        province = response.xpath('//table[@id="panoramica"]/tbody/tr')

        for provincia in province :
            provincia_name = provincia.xpath('//th[@scope="rowgroup"]//text()').extract_first()

            hrefs_l2 = provincia.xpath('//td/a/@href').extract()

            for href_l2 in hrefs_l2 :
                abs_href_l2 = target_url + href_l2

                yield Request(url = abs_href_l2, callback = self.parse_comune, meta = {'regione': response.meta['regione'],
                                                                                       'provincia' : provincia_name})

    def parse_comune (self, response) :# Parsing of 'comune' (Layer 3)
        comuni = response.xpath('//table[@id="panoramica"]/tbody/tr')

        for comune in comuni :
            comune_name = comune.xpath('//th[@scope="rowgroup"]//text()').extract_first()

            hrefs_l3 = comune.xpath('//td/a/@href').extract()

            for href_l3 in hrefs_l3 :
                abs_href_l3 = protocol + domain + href_l3

                yield Request(url = abs_href_l3, callback = self.parse_entries, meta = {'regione'   : response.meta['regione'],
                                                                                        'provincia' : response.meta['provincia'],
                                                                                        'comune'    : comune_name})

    def parse_entries (self, response) :# Parsing of 'entries' (list of the products)
        entries = response.xpath('//*[@class="listing-item"]')

        properties = {}

        properties['regione']   = response.meta['regione']
        properties['provincia'] = response.meta['provincia']
        properties['comune']    = response.meta['comune']

        yield properties

Answer 1

问题是，当您循环 Selectors 并调用 xpath 方法时，您应该使用例如相对于当前选择器进行查询./.

所以在你的parse方法中你应该使用

regione_name = regione.xpath('./th[@scope="rowgroup"]//text()').get()

否则，您只会获得整个文档中的第一个 th。

您的用例的另一个提示是使用 response.follow 而不是像您一样构建 Requests 。例如，您的 parse 方法（与您的其他方法几乎相同）可以变为

def parse(self, response):
    for regione in response.xpath('//table[@id="panoramica"]/tbody/tr'):
        regione_name = regione.xpath('./th[@scope="rowgroup"]//text()').get()
        if not regione_name:
            continue

        for link in regione.xpath("./td/a"):
            yield response.follow(link, callback=self.parse_provincia, meta=...)

多页爬虫给出了错误的结果

multi-page crawler gives wrong results

python

scrapy