试图解决一个 scrapy python for 循环

Question

如果可能的话，我想寻求一些帮助来从网页上抓取一些细节。
https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&period=today&gf=13

结构如下

Webpage data structure

Webpage data structure expanded

我可以使用以下命令检索所有歌曲：

response.css("div.trk-cell.title a").xpath("@href").extract()

或

resource.xpath("//div[@class='trk-cell title']/a/@href").get()

我可以使用以下命令检索所有艺术家：

response.css("div.trk-cell.artists a").xpath("@href").extract()

或

resource.xpath("//div[@class='trk-cell artists']/a/@href").get()

所以现在我正在尝试执行一个循环，提取页面上的所有标题和艺术家，并将每个结果一起封装在 csv 或 json 中。我正在努力计算 for 循环，我一直在尝试以下但没有成功。

import scrapy


class QuotesSpider(scrapy.Spider):
        name = "traxsourcedeephouse"
        start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']

        def parse(self, response):
                for track in response.css("div.trklist.v-.full.v5"):
                        yield {
                                'link': track.xpath("//div[@class='trk-cell title']/a/@href").get(),
                                'artists': track.xpath("//div[@class='trk-cell artists']/a/@href").get()
                                }

据我所知，“trklist”div 似乎封装了艺术家和标题 div，所以我不确定为什么这段代码不起作用。

我已经在 scrapy shell 中尝试了以下命令，它没有 return 我怀疑是问题的任何结果，但为什么不呢？

response.css("div.trklist.v-.full.v5")

向正确的方向推动会很有帮助，谢谢

Answer 1

在 scrapy shell 中，如果您执行 view(response) 以在网络浏览器中查看您的响应。您会发现没有数据，因为数据是使用 javascript 动态生成的，而 scrapy 不起作用。你应该使用硒或其他。

Answer 2

您只 select 包含项目的 table，但不包含项目本身，因此您并没有真正循环遍历它们。
CSS select 或 table 在 scrapy 上有点不同，所以我们需要匹配它（没有 v5）。
在循环中，您在 track.xpath(...).
注意我在代码中添加了“hdr”，我这样做是为了跳过 table 的 header。

我为 for 循环添加了 CSS 和 xpath（它们都有效，请选择其中之一）：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "traxsourcedeephouse"
    start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']

    def parse(self, response):
        # for track in response.css('div.trklist.v-.full div.trk-row:not(.hdr)'):
        for track in response.xpath('//div[@class="trklist v- full init-invis"]/div[not(contains(@class, "hdr"))]'):
            yield {
                'link': track.xpath(".//div[@class='trk-cell title']/a/@href").get(),
                'artists': track.xpath(".//div[@class='trk-cell artists']/a/@href").get()
            }

试图解决一个 scrapy python for 循环

Trying to resolve a scrapy python for loop

scrapy