在 scrapy 中关注链接的问题
Issues on following links in scrapy
我想抓取一个包含多个类别网站的博客。从第一个类别开始浏览页面,我的目标是按照类别收集每个网页。我已经收集了第一类的网站,但蜘蛛停在那里,无法到达第二类。
示例草稿:
我的代码:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem
class my_spider(CrawlSpider):
name = 'heart'
allowed_domains = ['greek-sites.gr']
start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
for category in categories:
item = DmozItem()
item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract()
item['category'] = response.xpath('//div/strong/text()').extract()
return item
问题很简单:callback
必须不同于 parse
,所以我建议您将方法命名为 parse_site
,然后您就可以继续抓取了.
如果您进行以下更改,它将起作用:
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)
def parse_site(self, response):
原因描述如下in the docs:
When writing crawl spider rules, avoid using parse
as callback, since the CrawlSpider
uses the parse
method itself to implement its logic. So if you override the parse
method, the crawl spider will no longer work.
我想抓取一个包含多个类别网站的博客。从第一个类别开始浏览页面,我的目标是按照类别收集每个网页。我已经收集了第一类的网站,但蜘蛛停在那里,无法到达第二类。
示例草稿:
我的代码:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem
class my_spider(CrawlSpider):
name = 'heart'
allowed_domains = ['greek-sites.gr']
start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
for category in categories:
item = DmozItem()
item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract()
item['category'] = response.xpath('//div/strong/text()').extract()
return item
问题很简单:callback
必须不同于 parse
,所以我建议您将方法命名为 parse_site
,然后您就可以继续抓取了.
如果您进行以下更改,它将起作用:
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)
def parse_site(self, response):
原因描述如下in the docs:
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic. So if you override theparse
method, the crawl spider will no longer work.