在 scrapy 中关注链接的问题

Question

我想抓取一个包含多个类别网站的博客。从第一个类别开始浏览页面，我的目标是按照类别收集每个网页。我已经收集了第一类的网站，但蜘蛛停在那里，无法到达第二类。

示例草稿：

我的代码：

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem

    class my_spider(CrawlSpider):
    name = 'heart'
    allowed_domains = ['greek-sites.gr']
    start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']

    rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)


    def parse(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
        for category in categories:
            item = DmozItem()
            item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract() 
            item['category'] = response.xpath('//div/strong/text()').extract() 
        return item

Answer 1

问题很简单：callback 必须不同于 parse，所以我建议您将方法命名为 parse_site，然后您就可以继续抓取了.

如果您进行以下更改，它将起作用：

rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)

def parse_site(self, response):

原因描述如下in the docs:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

在 scrapy 中关注链接的问题

Issues on following links in scrapy

scrapy

python-2.7