如何使用 Scrapy 从一个可以加载更多帖子的网站上抓取数据？

Question

我想使用 scrapy 从以下网站的所有帖子中获取标题和日期： https://economictimes.indiatimes.com/markets/stocks/recos 我是 scrapy 的新手，无法理解如何加载更多帖子并抓取它们。

这是我按照教程编写的代码，但它只删除了前几篇文章。

import scrapy

class PostsSpider(scrapy.Spider):
    name="posts"

    start_urls=[
        'https://economictimes.indiatimes.com/markets/stocks/recos'
    ]

    def parse(self,response):
        for post in response.css('div.eachStory'):
            yield{
                'title': post.css('a::text').get(),
                'date' : post.css('time::text').get()
            }
            next_page=response.css('div.autoload_continue').get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

我正在使用 scrapy crawl posts -o posts.csv 以 csv 格式打印出来。我不确定是否有可能完成所有帖子。任何帮助将不胜感激，提前致谢。

Answer 1

据我所知，div.autoload_continue 不包含任何 link。它就像一个按钮，如果您单击它，它会请求 JavaScript。您可以通过在 Devtools > Networks.

中查看来检查请求的端点

这是我看到的：该网站要求首次加载 https://economictimes.indiatimes.com/lazyloadlistnew.cms?msid=3053611&curpg=1&img=1.
然后如果我向下滚动它请求
https://economictimes.indiatimes.com/lazyloadlistnew.cms?msid=3053611&curpg=2&img=1
当我点击 加载更多 时，它请求
https://economictimes.indiatimes.com/lazyloadlistnew.cms?msid=3053611&curpg=3&img=0

看参数curpg，是递增的；它表示页面。您可能只是迭代数字来更改 curpg 参数。
img 参数是显示图像的切换。
至于 msid 参数，它是文章列表的id。您可以从 head <meta content="https://economictimes.indiatimes.com/markets/stocks/recos/articlelist/3053611.cms" property="og:url">

中的元数据中找到该值

如何使用 Scrapy 从一个可以加载更多帖子的网站上抓取数据？

How to use Scrapy to scrape data from a website which has an option to Load More posts?

python

scrapy