从 div 抓取数据，如页面所示

Question

我正在尝试从此 URL https://eksisozluk.com/mortingen-sitraze--1277239 中抓取数据，我想抓取标题，然后是标题下方的所有评论。如果你打开网站，你会看到标题下的第一条评论是 (bkz: mortingen)。问题是 (bkz 在 div 里面 div mortingen 在锚里面 link，所以很难像 webste 上显示的那样抓取数据。任何人都可以帮助我使用 CSS 选择器或 Xpath 来抓取所有评论，如图所示。我的代码写在下面，但它在一列中给了我 (bkz: 然后是 akhisar 然后是 )在三个单独的列中而不是一个

def parse(self, response):
    data={}
    #count=0
    title = response.css('[itemprop="name"]::text').get()
    #data["Title"] = title
    count=0
    data["title"] = title
    count=0
    for content in response.css('li .content ::text'):
        text = content.get()
        text=text.strip()
        content = "content" +str(count)
        data[content] = text
        count=count+1
    yield data

Answer 1

您应该首先获得所有 .content 而没有 ::text，然后使用 for 循环分别处理每个 .content。对于每个 .content，你应该运行 ::text 只获取此内容中的所有文本，放入列表中，然后将其加入单个字符串

       for count, content in enumerate(response.css('li .content')):
            text = []

            # get all `::text` in current `.content`
            for item in content.css('::text'):
                item = item.get()#.strip()
                # put on list
                text.append(item)

            # join all items in single string
            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

最少的工作代码。

您可以将所有代码放在一个文件中运行 python script.py 而无需在 scrapy 中创建项目。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://eksisozluk.com/mortingen-sitraze--1277239']

    def parse(self, response):
        print('url:', response.url)

        data = {}  # PEP8: spaces around `=`

        title = response.css('[itemprop="name"]::text').get()
        data["title"] = title

        for count, content in enumerate(response.css('li .content')):
            text = []

            for item in content.css('::text'):
                item = item.get()#.strip()
                text.append(item)

            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

        yield data
    
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

编辑：

getall()

稍短

        for count, content in enumerate(response.css('li .content')):

            text = content.css('::text').getall()

            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

从 div 抓取数据，如页面所示

Scrape data from div as shown on the page

python

xpath

css-selectors

scrapy

web-scraping