使用 Scrapy 提取并保存所有 text()

Question

我想使用 Python 从 HTML 文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我想要得到的输出基本上相同。为了解决这个问题，我需要使用框架。以页面https://en.wikipedia.org/wiki/Main_Page为例，在不离开域en.wikipedia.org

的情况下提取100页

Answer 1

简单的基本代码示例可满足您的需求。

from scrapy import Spider


class Foo(Spider):
    # start urls executed at the beginning
    # with default callback "parse"
    start_urls = ["https://en.wikipedia.org/wiki/Main_Page"]
    name = "basic_spider"

    def parse(self, response):
        # use css or xpath selectors to extract text
        print(response.css("::text").extract())

将上面另存为 spider.py 并运行用

scrapy runspider spider.py

从 Scrapy tutorial 开始，如果您觉得有什么地方不清楚或需要改进，请随时改进文档，它们位于 github。

当然你必须学习 Python first，所以如果你还不知道 Python，请从学习开始。

使用 Scrapy 提取并保存所有 text()

Extraxt & save all text() with Scrapy

python

web-crawler

scrapy