使用 python 和 scrapy 删除第一个标签 html

Question

我有一个HTML：

<div class="abc">
            <div class="xyz">
                <div class="needremove"></div>
                <p>text</p>
                <p>text</p>
                <p>text</p>
                <p>text</p>
            </div>
    </div>

我用过： response.xpath('//div[包含(@class,"abc")]/div[包含(@class,"xyz")]').提取物()

结果：

u'['<div class="xyz">
        <div class="needremove"></div>
        <p>text</p>
        <p>text</p>
        <p>text</p>
        <p>text</p>
    </div>']

我想删除 <div class="needremove"></div>。你能帮帮我吗？

Answer 1

你可以用class="needremove"获取除div以外的所有子标签:

response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()

来自 shell 的演示：

$ scrapy shell index.html
In [1]: response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()
Out[1]: [u'<p>text</p>', u'<p>text</p>', u'<p>text</p>', u'<p>text</p>']

使用 python 和 scrapy 删除第一个标签 html

Remove first tag html using python & scrapy

python

xpath

scrapy

scrapy-spider