如何使用 CrawlSpider 抓取多个 json 页面？

Question

我有一个要抓取的 json 文件：https://www.website.com/api/list?limit=50&page=1

我可以使用 'scrapy.Spider' 来抓取所有页面，但如果可能的话，我更愿意使用 'CrawlSpider'。

我尝试使用：

    start_urls=['https://www.website.com']
    rules = (                                                                               
        Rule(LinkExtractor(allow=r'/api/list\?.+page=\d+'), callback='parse_page', follow=True),
    )

和（只是为了看看它是否甚至获得了第一页）：

    start_urls=['https://www.website.com']
    rules = (                                                                               
        Rule(LinkExtractor(allow=r'/api/list'), callback='parse_page', follow=True),
    )

和 none 他们成功了。

有没有办法用 'CrawlSpider' 做到这一点？

Answer 1

使用 CrawlSpider 是不可能的。
LinkExtractor used to process CrawlSpider Rules -> can extract links only from html responses (not json api) from tags a and area

如何使用 CrawlSpider 抓取多个 json 页面？

How to scrape multiple json pages with CrawlSpider?

scrapy