如何在 scrapy 中重置标准 dupefilter
How to reset standard dupefilter in scrapy
出于某些原因,我想重置 scrapy
在我的蜘蛛代码的某个点内部维护的可见 url 列表。
我知道默认情况下 scrapy 使用 RFPDupeFilter
class 并且有一个 fingerprint
集合。
如何在蜘蛛代码中清除此集合?
更具体地说:我想在由 spider_idle
信号调用的自定义 idle_handler
方法中清除集合。
您可以通过self.crawler.engine.slot.scheduler.df
访问蜘蛛使用的当前dupefilter
对象。
from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher
class ExampleSpider(Spider):
name = "example"
start_urls = ['http://www.example.com/']
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.reset_dupefilter, signals.spider_idle)
def reset_dupefilter(self, spider):
# clear stored fingerprints by the dupefilter when idle
self.crawler.engine.slot.scheduler.df.fingerprints = set()
def parse(self, response):
pass
您可以通过初始化指纹来重置指纹设置
self.crawler.engine.slot.scheduler.df.fingerprints = set()
到一个空集。
将以下代码放入您的蜘蛛中。
def reset_filter(self, spider):
self.crawler.engine.slot.scheduler.df.fingerprints = set()
#overriding the default from_crawler class method to access scrapy core components
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
#initiate an event signal when spider is idle
crawler.signals.connect(spider.reset_filter, signals.spider_idle)
return spider
出于某些原因,我想重置 scrapy
在我的蜘蛛代码的某个点内部维护的可见 url 列表。
我知道默认情况下 scrapy 使用 RFPDupeFilter
class 并且有一个 fingerprint
集合。
如何在蜘蛛代码中清除此集合?
更具体地说:我想在由 spider_idle
信号调用的自定义 idle_handler
方法中清除集合。
您可以通过self.crawler.engine.slot.scheduler.df
访问蜘蛛使用的当前dupefilter
对象。
from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher
class ExampleSpider(Spider):
name = "example"
start_urls = ['http://www.example.com/']
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.reset_dupefilter, signals.spider_idle)
def reset_dupefilter(self, spider):
# clear stored fingerprints by the dupefilter when idle
self.crawler.engine.slot.scheduler.df.fingerprints = set()
def parse(self, response):
pass
您可以通过初始化指纹来重置指纹设置
self.crawler.engine.slot.scheduler.df.fingerprints = set()
到一个空集。 将以下代码放入您的蜘蛛中。
def reset_filter(self, spider):
self.crawler.engine.slot.scheduler.df.fingerprints = set()
#overriding the default from_crawler class method to access scrapy core components
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
#initiate an event signal when spider is idle
crawler.signals.connect(spider.reset_filter, signals.spider_idle)
return spider