Scrapy

Question

我需要创建一个网站列表 url。我为此使用 Scrapy 2.3.0。问题是结果 ('item_scraped_count') 是 63 个链接，但我知道还有更多。

有什么方法可以处理更深的层次并提取 url？

我的代码如下：

from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

from scrapy import Item
from scrapy import Field


class UrlItem(Item):
    url = Field()


class RetriveUrl(CrawlSpider):
    name = 'retrive_url'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LinkExtractor(), callback='parse_url'),
    )

    def parse_url(self, response):
        item = UrlItem()
        item['url'] = response.url

        return item

Answer 1

您应该让爬行跟随到更深的层次。试试这个：

Rule(LinkExtractor(), callback='parse_url', follow=True),

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.

(From the Scrapy docs)

Scrapy - 无法列出更深层的链接

Scrapy - can not list deeper links

python

url

web-crawler

web-scraping