如何阻止我的爬虫记录重复项?

How to stop my crawler from logging duplicates?

我想知道如何阻止它多次记录相同的 url?

到目前为止,这是我的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        item = MyItem()
        item['url'] = link.url
        self.f.write(item['url']+"\n")

现在,它会为单个 link 执行数千次重复,例如,在一个拥有大约 250,000 个帖子的 vBulletin 论坛中。

编辑: 请注意,cralwer 将获得数以百万计的 links。 因此,我需要代码能够非常快速地进行检查。

创建一个已访问过的 url 列表并检查每个 URL。因此在解析特定 URL 之后将其添加到列表中。在访问新发现的 URL 上的页面之前,请检查此 URL 是否已在该列表中,然后解析它并添加或跳过。

即:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items=[] #list with your URLs
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items.append(link)   #add to list if it's not parsed yet
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

词典版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items={} #dictionary with your URLs as keys
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items[link]=1  #add to dictionary as key if it's not parsed yet (stored value can be anything)
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

P.S。也可以先收集items,再写入文件

此代码还有许多其他改进,但我留给您研究。