如何在继承的 CrawlSpider 中重用基于 scrapy Spider 的蜘蛛的解析方法?
How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?
我目前有一个基于 Spider 的蜘蛛,我编写它用于抓取 start_urls
的输入 JSON 数组:
from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
import json
import datetime
import re
class AtlanticFirearmsSpider(Spider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls='[]', *args, **kwargs):
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
self.start_urls = json.loads(start_urls)
def parse(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
我可以像这样从命令行调用它,它做得很好:
scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'
但是,我正在尝试添加一个基于 CrawlSpider 的蜘蛛来抓取从它继承并重新使用 parse
方法逻辑的整个站点。我的第一次尝试是这样的:
class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
name = "atlantic_firearms_crawler"
start_urls = [
"http://www.atlanticfirearms.com"
]
rules = (
# I know, I need to update these to LxmlLinkExtractor
Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
)
运行 这只蜘蛛
scrapy crawl atlantic_firearms_crawler
抓取网站但从不解析任何项目。我想是因为 CrawlSpider apparently has its own definition of parse
,所以我搞砸了。
当我将 callback='parse'
更改为 callback='parse_item'
并将 AtlanticFirearmsSpider
中的 parse
方法重命名为 parse_item
时,效果非常好,可以抓取整个站点并且成功解析项目。但是如果我再次尝试调用我原来的 atlantic_firearms
spider,它会出错并显示 NotImplementedError
,显然是因为基于 Spider 的蜘蛛真的希望将解析方法定义为 parse
.
我在这些蜘蛛之间重用我的逻辑的最佳方式是什么,这样我既可以提供 start_urls
的 JSON 数组,又可以进行全站抓取?
您可以在这里避免多重继承。
将两只蜘蛛合二为一。如果 start_urls
将从命令行传递 - 它会表现得像 CrawlSpider
,否则就像一个普通的蜘蛛:
from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor
import json
class AtlanticFirearmsSpider(CrawlSpider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls=None, *args, **kwargs):
if start_urls:
self.start_urls = json.loads(start_urls)
self.rules = []
self.parse = self.parse_response
else:
self.start_urls = ["http://www.atlanticfirearms.com/"]
self.rules = [
Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
]
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
def parse_response(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
或者,或者,只需将 parse()
方法中的逻辑提取到一个库函数中,然后从两个不相关的蜘蛛、单独的蜘蛛中调用。
我目前有一个基于 Spider 的蜘蛛,我编写它用于抓取 start_urls
的输入 JSON 数组:
from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
import json
import datetime
import re
class AtlanticFirearmsSpider(Spider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls='[]', *args, **kwargs):
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
self.start_urls = json.loads(start_urls)
def parse(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
我可以像这样从命令行调用它,它做得很好:
scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'
但是,我正在尝试添加一个基于 CrawlSpider 的蜘蛛来抓取从它继承并重新使用 parse
方法逻辑的整个站点。我的第一次尝试是这样的:
class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
name = "atlantic_firearms_crawler"
start_urls = [
"http://www.atlanticfirearms.com"
]
rules = (
# I know, I need to update these to LxmlLinkExtractor
Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
)
运行 这只蜘蛛
scrapy crawl atlantic_firearms_crawler
抓取网站但从不解析任何项目。我想是因为 CrawlSpider apparently has its own definition of parse
,所以我搞砸了。
当我将 callback='parse'
更改为 callback='parse_item'
并将 AtlanticFirearmsSpider
中的 parse
方法重命名为 parse_item
时,效果非常好,可以抓取整个站点并且成功解析项目。但是如果我再次尝试调用我原来的 atlantic_firearms
spider,它会出错并显示 NotImplementedError
,显然是因为基于 Spider 的蜘蛛真的希望将解析方法定义为 parse
.
我在这些蜘蛛之间重用我的逻辑的最佳方式是什么,这样我既可以提供 start_urls
的 JSON 数组,又可以进行全站抓取?
您可以在这里避免多重继承。
将两只蜘蛛合二为一。如果 start_urls
将从命令行传递 - 它会表现得像 CrawlSpider
,否则就像一个普通的蜘蛛:
from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor
import json
class AtlanticFirearmsSpider(CrawlSpider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls=None, *args, **kwargs):
if start_urls:
self.start_urls = json.loads(start_urls)
self.rules = []
self.parse = self.parse_response
else:
self.start_urls = ["http://www.atlanticfirearms.com/"]
self.rules = [
Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
]
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
def parse_response(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
或者,或者,只需将 parse()
方法中的逻辑提取到一个库函数中,然后从两个不相关的蜘蛛、单独的蜘蛛中调用。