无法为每条评论获取每个 <p> 标签

Question

我正在尝试抓取视频的评论，我可以使用 scrapy 从该站点轻松获取除每个特定评论正文之外的所有内容：https://tamasha.com/v/KGbXY

from scrapy.selector import Selector

    def crawl_comment(self, video_id):
        video_url = f"https://www.tamasha.com/v/{video_id}"
        # response = self.request.get_request(video_url, proxy=self.proxy, timeout=30, headers=None)
        response = self.request.get_request(video_url, timeout=30, headers=None)
        if response.status_code == 404:
            raise VideoNotFoundException()
        comment_information = Selector(text=response.text).xpath(
            '//*[@class="comment-item"]').getall()
        comment_data_list = []
        for comment_info in comment_information:
            video_id = video_id
            author_username = None
            try:
                author_username = Selector(text=comment_info).xpath('string(//*[@class="user-name"])').get()
            except:
                pass
            author_id = None
            try:
                author_id = Selector(text=comment_info).xpath('//*[@class="user-name"]/@href').get()
                author_id = author_id.split('/')[-1]
            except:
                pass
            date = Selector(text=response.text).xpath('//*[@class="comment-time"]/text()').get()
            body = Selector(text=response.text).css('#commentBox > div:nth-child(2) > div.more-comment > p').get
            id = Selector(text=response.text).xpath('//*[@class="comment-item"]/@data-comment-id').get()
            comment_data_list.append({
                'author_username': author_username,
                'author_id': author_id,
                'date': date,
                'body': body,
                'id': id,
                'video_id': video_id
            })
        print(comment_data_list)

我想获取每个评论的文本但不能，获取该部分的代码在正文字段中。

Answer 1

不知道页面上的语言。但这是使用 CSS 选择器提取评论的方法。

# selector is `Selector(text=response.text)`
selector.css('div.comment-item > .comment .comment-header + p::text').getall()

# explanation of css selector
# >, direct child
# space, descendent
# +, adjacent sibling

这是输出

# the second comment seems to be a comment replay
['چه خوب بودن همشون', 'عالی بودن', 'nice']

顺便说一句，你是不是在使用 scrapy 而只使用了解析器？在 Scrapy 中，Response.xpath() 会自动调用 Selector(response_text).xpath()。

如果你不用scrapy，只想要解析器。 pip install parsel，并使用 parsel.Selector。 parsel是Scrapy中集成的解析器。可以独立使用

无法为每条评论获取每个 <p> 标签

Can't get each <p> tag for every comment

python

scrapy

web-scraping