Scrapy 句柄缺少路径

Question

我正在为一个大学项目构建一个论坛抓取工具。我正在使用的论坛页面如下：https://www.eurobricks.com/forum/index.php?/forums/topic/163541-lego-ninjago-2019/&tab=comments#comment-2997338。我能够提取除位置之外的所有我需要的信息。此信息存储在以下路径中。

<li class="ipsType_light"> <\li>
<span class="fc">Country_name<\span>

问题是有时这个信息和路径不存在。但我的实际解决方案无法处理。这是我为获取有关位置的信息而编写的代码。

location_path = "//span[@class='fc']/text()"
def parse_thread(self, response):
        
        comments = response.xpath("//*[@class='cPost_contentWrap ipsPad']")

        username = response.xpath(self.user_path).extract()

        x = len(username)
            
        if x>0:
            score = response.xpath(self.score_path).extract()
            content = ["".join(comment.xpath(".//*[@data-role='commentContent']/p/text()").extract()) for comment in comments]
            date = response.xpath(self.date_path).extract()
            location = response.xpath(self.location_path).extract()

        for i in range(x):     
            yield{
                "title": title,
                "category": category,
                "user": username[i],
                "score": score[i],
                "content": content[i],
                "date": date[i],
                "location": location[i]
            }

我尝试过的一种可能的解决方案是检查位置的长度，但不起作用。现在代码结果如下（示例数据）

Title | category | test1 | 502 | 22 june 2020 | correct country
Title | category | test2 | 470 | 22 june 2020 | wrong country (it takes the next user country)
Title | category | test3 | 502 | 28 june 2020 | correct country

而我想要实现的是：

Title | category | test1 | 502 | 22 june 2020 | correct country
Title | category | test2 | 470 | 22 june 2020 | Not available
Title | category | test3 | 502 | 28 june 2020 | correct country

Answer 1

我的问题的解决方案是，而不是 select 一个一个地输入具体信息。首先，我必须 select 包含所有信息的整个块，然后才选择我需要的单个信息。

Scrapy 句柄缺少路径

Scrapy handle missing path

python

xpath

web-crawler

scrapy

web-scraping