python Scrapy获取一个部分的所有文本标签

Question

我想使用 Scrapy 获取任何类型的文本标签，如 h1、p、span、strong 和其他在 section 标签中的标签，而忽略其他标签，如 img :

<section>
<h1>text</h1>
<h2>text</h2>

<span>text</span>

<img>text</img>

<p>text</p>
<p>text</p>
<p>text</p>
</section>

我的起始代码是这样的：

import scrapy


class example (scrapy.Spider):
    name = 'example '
    allowed_domains = ['www.example .com']
    start_urls = ['example ']

    def parse(self, response):
      self.log('//////////////////////////////////////////////////////////////')
      section= response.xpath('//section')
      for p in section.xpath('.//p/text()'):
          self.log('//////////////////////////////////////////////////////////////')
          self.log(p.extract())

现在正如我所说，我需要获取任何文本标签，而不是只选择 p 标签。有什么办法吗？

Answer 1

在这种情况下，唯一的选择是循环浏览每个 html 标签并按名称过滤它

def parse(self, response):
    req_tags = ['h1', 'p', 'span', 'strong']
    section_selector = response.css('section')
    for section in section_selector:
        texts = []
        for tag in section.css('*'):
            if tag.root.tag in req_tags:
                texts = texts + tag.css('*::text').getall()
        self.log(texts)

对于这种情况 - 每个标签名称都需要直接放在 req_tags 列表中。

python Scrapy获取一个部分的所有文本标签

python Scrapy get all text tags in a section

python

scrapy