具有许多重复元素 class 名称的 Scrapy 爬网

Question

我是 Scrapy 的新手，正在尝试抓取网络，但 HTML 元素由许多 DIV 组成，它们具有重复的 class 名称，例如

<section class= "pi-item pi-smart-group pi-border-color">

<section class="pi-smart-group-head">
    <h3 class = "pi-smart-data-label pi-data-label pi-secondary-font pi-item-spacing">
</section>

    <section class= "pi-smart-group-body">
        <div class="pi-smart-data-value pi-data-value pi-font pi-item-spacing">
            <a href="abc" title="!! What I want !!"> </a>
        </div>
    </section>
</section>

我的问题是这个结构对许多其他元素重复，当我使用 response.css 我会得到多个我不想要的元素

（基本上我想从 https://pokemon.fandom.com/wiki/Bulbasaur 中抓取每个口袋妖怪的口袋妖怪信息，例如“类型”、“物种”和“能力”，我已经为所有口袋妖怪获取 url 但卡在从每个口袋妖怪那里获取信息）

Answer 1

我已经尝试为您做这个 scrapy 项目并得到了结果。我看到的问题是您使用了 CSS。你可以用它来抓取，但使用 Xpath selectors 更有效。您可以更灵活地 select 选择您想要的特定标签。这是我为您编写的代码。请记住，这段代码只是我为获得您的结果而快速完成的。它有效，但我是用这种方式做的，所以你很容易理解它，因为你是 scrapy 的新手。如果这有帮助，请告诉我

import scrapy


class PokemonSpiderSpider(scrapy.Spider):
    name = 'pokemon_spider'
    start_urls = ['https://pokemon.fandom.com/wiki/Bulbasaur']

    def parse(self, response):
        pokemon_type = response.xpath("(//div[@class='pi-data-value pi-font'])[1]/a/@title")
        pokemon_species = response.xpath('//div[@data-source="species"]//div/text()')
        pokemon_abilities = response.xpath('//div[@data-source="ability"]/div/a/text()')

        yield {
            'pokemon type': pokemon_type.extract(),
            'pokemon species': pokemon_species.extract(),
            'pokemon abilities': pokemon_abilities.extract()
        }

Answer 2

您可以将 XPath 表达式与属性文本一起使用：

abilities = response.xpath('//h3[a[.="Abilities"]]/following-sibling::div[1]/a/text()').getall()
species = response.xpath('//h3[a[.="Species"]]/following-sibling::div[1]/text()').get()

具有许多重复元素 class 名称的 Scrapy 爬网

Scrapy crawl web with many duplicated element class name

web-crawler

scrapy