Scrapy，如何提取h3内容？

Question

我想提取每个div class="summary"中的网页内容。在每个 summary div 中，我想提取 div 中每个 class 中的数据。

以下是我的片段。

questions = Selector(response).xpath('//div[@class="summary"]')
for question in questions:
    item = StackItem()
    # get the hyperlink of h3 text
    item['title'] = question.xpath('a[@h3]/text()').extract()[0]
    yield item

我应该如何在我的代码中编写 xpath 内容？

Answer 1

您的第二个 XPath 查找 a 元素，它是 div[@class="summary"] 的直接子元素并且具有不存在的属性 h3在 HTML 发布。

从 div 中获取 h3 中的 a 元素的正确 XPath 如下：

h3/a/text()

Answer 2

另一种说法可能是：

questions = Selector(response).xpath('div[@class="summary"]/h3')

为了从 <a> 中获取数据：

item['title'] = question.xpath('/a/text()').extract()[0]

如果您要提取的所有数据都在 h3 标签内，这将很有用。

Scrapy，如何提取h3内容？

Scrapy, how to extract h3 content?

css

python

xpath

web-crawler

scrapy