在条件下使用 scrapy 选择器

Question

我正在使用 "scrapy" 抓取一些文章，例如这些文章：https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
我在我的蜘蛛中使用以下代码：

    def parse_article(self, response):
       il = ItemLoader(item=Scrapping538Item(), response=response)
       il.add_css('article_text', '.entry-content *::text')

...有效。但我想让这个 CSS-选择器更复杂一点。现在，我正在提取每个文本段落。但是看看这篇文章，里面有 table 和可视化，其中也包括文本。 HTML 结构如下所示：

<div class="entry-content single-post-content">
    <p>text I want</p>
    <p>text I want</p>
    <p>text I want</p>
    <section class="viz">
        <header class="viz">
            <h5 class="title">TITLE-text</h5>
            <p class="subtitle">SUB-TITLE-text</p>
        </header>
        <table class="viz full"">TABLE DATA</table>
    </section>
    <p>text I want</p>
    <p>text I want</p>
</div>

使用上面的代码片段，我得到类似的东西：

text I want
text I want
text I want
TITLE-text <<<< (text I don't want)
SUB-TITLE-text <<<< (text I don't want)
TABLE DATA <<<< (text I don't want)
text I want
text I want

我的问题：

我如何修改 add_css() 函数，使其能够获取除 table?
函数 add_xpath 会更简单吗？
一般来说，最佳做法是什么？（提取文本在条件下）

非常感谢反馈

Answer 1

您可以使用 XPath 和 ancestor 轴获得所需的输出：

'//*[contains(@class, "entry-content")]//text()[not(ancestor::*[@class="viz"])]'

Answer 2

除非我遗漏了一些重要的东西，否则下面的 xpath 应该可以工作：

import scrapy
import w3lib

raw = response.xpath(
    '//div[contains(@class, "entry-content") '
    'and contains(@class, "single-post-content")]/p'
).extract()

这会省略 table 内容，只会将段落和链接中的文本生成为列表。但是有一个陷阱！由于我们没有使用 /text()，所有 <p> 和 <a> 标签仍然存在。让我们删除它们：

cleaned = [w3lib.html.remove_tags(block) for block in raw]

Answer 3

在 CSS 表达式中使用 > 到 limit it to children (direct descendants)。

.entry-content > *::text

在条件下使用 scrapy 选择器

Using scrapy selector with conditions

python

web-crawler

css-selectors

scrapy