使用 Scrapy 从文本文件中的多个 URL 中抓取所有外部链接

Question

我是 Scrapy 的新手，Python 因此我是初学者。我希望能够让 Scrapy 读取一个包含大约 100k url 种子列表的文本文件，让 Scrapy 访问每个 URL，并提取所有外部 URLs（URLs of Other网站）在每个种子 URL 上找到并将结果导出到单独的文本文件。

Scrapy 应该只访问文本文件中的 URLs，而不是抓取并跟随任何其他 URL.

我希望能够让 Scrapy 尽可能快地工作，我有一个非常强大的服务器，有 1GBS 线路。我列表中的每个 URL 都来自一个独特的域，因此我根本不会用力访问任何 1 个站点，因此不会遇到 IP 块。

我将如何着手在 Scrapy 中创建一个项目，以便能够从存储在文本文件中的 url 列表中提取所有外部链接？

谢谢。

Answer 1

你应该使用：
1. start_requests 读取url列表的功能。
2. css 或所有 "a" html 元素的 xpath 选择器。

from scrapy import Spider

class YourSpider(Spider):
    name = "your_spider"

    def start_requests(self):
        with open('your_input.txt', 'r') as f:  # read the list of urls
           for url in f.readlines()             # process each of them
               yield Request(url, callback=self.parse)

    def parse(self, response):
        item = YourItem(parent_url=response.url)
        item['child_urls'] = response.css('a::attr(href)').extract()
        return item

有关 start_requests 的更多信息，请点击此处：
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests

要将已抓取的项目提取到另一个文件，请使用项目管道或提要导出。此处的基本管道示例：
http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file

使用 Scrapy 从文本文件中的多个 URL 中抓取所有外部链接

Scrape All External Links from Multiple URLs in a Text File with Scrapy

python

url

web-crawler

scrapy

web-scraping