在列的 table 个单元格中获取文本

Question

我是 Scrapy 的新手，正在抓取像 this 这样有几个 table 的维基百科网站。我的目标是从每个 table 的第一列中获取所有文本，将每个文本附加到字符串列表。

有些文字是 link 的一部分。

例如this table column。第一个单元格在锚元素“Double steaming”内有文本，但也有文本“/ double boiling”。

我试过了：

for table in response.css('.wikitable'):
    table.css('td:nth-child(1) ::text').get()

但这只会获取每个 table 的第一个单元格文本，而不是所有列文本：

'Double steaming'

然后尝试使用 getall:

for table in response.css('.wikitable'):
    table.css("td:nth-child(1) ::text").getall()

但这会分别获取第一列中的所有文本：

['Double steaming', ' / double boiling', 'Red cooking', 'Stir frying']

这是我想要的输出：

['Double steaming / double boiling', 'Red cooking', 'Stir frying']

我如何使用 Scrapy 做到这一点？

Answer 1

没有特定页面很难测试，但我能想到的解决方案是遍历每个单元格并加入文本。

for table in response.css('.wikitable'): 
    cells = table.css("td:nth-child(1)") 
    column_data = [] 
    for cell in cells:  
        data = "".join(cell.css('::text').getall())
        # could preform additional cleaning here
        column_data.append(data.strip()) 
    print(column_data)

column_data理论上应该是['Double steaming / double boiling', 'Red cooking', 'Stir frying']

在列的 table 个单元格中获取文本

Get texts in table cells of column

python

css-selectors

scrapy

web-scraping