scrapy 深度爬行不起作用

Question

我正在编写 scrapy 代码来抓取第一页和给定网页的一个额外深度

不知怎么的，我的爬虫没有进入额外的深度。只抓取给定的起始 url 并结束其操作。

我添加了 filter_links 回调函数，但即使没有被如此清楚地调用，规则也会被忽略。可能的原因是什么，我可以改变什么以使其遵循规则

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from crawlWeb.items import CrawlwebItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class DmozSpider(CrawlSpider):
name = "premraj"
start_urls = [
    "http://www.broadcom.com",
    "http://www.qualcomm.com"
]
rules = [Rule(SgmlLinkExtractor(), callback='parse',process_links="process_links",follow=True)]
def parse(self, response):
    #print dir(response)
    #print dir(response)
    item=CrawlwebItem()

    item["html"]=response.body
    item["url"]=response.url
    yield item
def process_links(self,links):
    print links
    print "hey!!!!!!!!!!!!!!!!!!!!!"

Answer 1

CrawlSpider documentation 中有一个警告框。它说：

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

您的代码可能无法按预期工作，因为您使用 parse 作为回调。

scrapy 深度爬行不起作用

scrapy crawling at depth not working

python

scrapy

scrapy-spider