如何阻止我的爬虫记录重复项?
How to stop my crawler from logging duplicates?
我想知道如何阻止它多次记录相同的 url?
到目前为止,这是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
现在,它会为单个 link 执行数千次重复,例如,在一个拥有大约 250,000 个帖子的 vBulletin 论坛中。
编辑:
请注意,cralwer 将获得数以百万计的 links。
因此,我需要代码能够非常快速地进行检查。
创建一个已访问过的 url 列表并检查每个 URL。因此在解析特定 URL 之后将其添加到列表中。在访问新发现的 URL 上的页面之前,请检查此 URL 是否已在该列表中,然后解析它并添加或跳过。
即:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
items=[] #list with your URLs
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
if link not in self.items: #check if it's already parsed
self.items.append(link) #add to list if it's not parsed yet
#do your job on adding it to a file
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
词典版本:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
items={} #dictionary with your URLs as keys
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
if link not in self.items: #check if it's already parsed
self.items[link]=1 #add to dictionary as key if it's not parsed yet (stored value can be anything)
#do your job on adding it to a file
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
P.S。也可以先收集items
,再写入文件
此代码还有许多其他改进,但我留给您研究。
我想知道如何阻止它多次记录相同的 url?
到目前为止,这是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
现在,它会为单个 link 执行数千次重复,例如,在一个拥有大约 250,000 个帖子的 vBulletin 论坛中。
编辑: 请注意,cralwer 将获得数以百万计的 links。 因此,我需要代码能够非常快速地进行检查。
创建一个已访问过的 url 列表并检查每个 URL。因此在解析特定 URL 之后将其添加到列表中。在访问新发现的 URL 上的页面之前,请检查此 URL 是否已在该列表中,然后解析它并添加或跳过。
即:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
items=[] #list with your URLs
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
if link not in self.items: #check if it's already parsed
self.items.append(link) #add to list if it's not parsed yet
#do your job on adding it to a file
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
词典版本:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
items={} #dictionary with your URLs as keys
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
if link not in self.items: #check if it's already parsed
self.items[link]=1 #add to dictionary as key if it's not parsed yet (stored value can be anything)
#do your job on adding it to a file
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
P.S。也可以先收集items
,再写入文件
此代码还有许多其他改进,但我留给您研究。