Scrapy - 无法列出更深层的链接
Scrapy - can not list deeper links
我需要创建一个网站列表 url。我为此使用 Scrapy 2.3.0。
问题是结果 ('item_scraped_count') 是 63 个链接,但我知道还有更多。
有什么方法可以处理更深的层次并提取 url?
我的代码如下:
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Item
from scrapy import Field
class UrlItem(Item):
url = Field()
class RetriveUrl(CrawlSpider):
name = 'retrive_url'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
rules = (
Rule(LinkExtractor(), callback='parse_url'),
)
def parse_url(self, response):
item = UrlItem()
item['url'] = response.url
return item
您应该让爬行跟随到更深的层次。试试这个:
Rule(LinkExtractor(), callback='parse_url', follow=True),
follow
is a boolean which specifies if links should be followed from each response extracted with this rule. If callback
is None
follow defaults to True
, otherwise it defaults to False
.
我需要创建一个网站列表 url。我为此使用 Scrapy 2.3.0。 问题是结果 ('item_scraped_count') 是 63 个链接,但我知道还有更多。
有什么方法可以处理更深的层次并提取 url?
我的代码如下:
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Item
from scrapy import Field
class UrlItem(Item):
url = Field()
class RetriveUrl(CrawlSpider):
name = 'retrive_url'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
rules = (
Rule(LinkExtractor(), callback='parse_url'),
)
def parse_url(self, response):
item = UrlItem()
item['url'] = response.url
return item
您应该让爬行跟随到更深的层次。试试这个:
Rule(LinkExtractor(), callback='parse_url', follow=True),
follow
is a boolean which specifies if links should be followed from each response extracted with this rule. Ifcallback
isNone
follow defaults toTrue
, otherwise it defaults toFalse
.