如何阻止我的爬虫记录重复项？

Question

我想知道如何阻止它多次记录相同的 url？

到目前为止，这是我的代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        item = MyItem()
        item['url'] = link.url
        self.f.write(item['url']+"\n")

现在，它会为单个 link 执行数千次重复，例如，在一个拥有大约 250,000 个帖子的 vBulletin 论坛中。

编辑： 请注意，cralwer 将获得数以百万计的 links。因此，我需要代码能够非常快速地进行检查。

Answer 1

创建一个已访问过的 url 列表并检查每个 URL。因此在解析特定 URL 之后将其添加到列表中。在访问新发现的 URL 上的页面之前，请检查此 URL 是否已在该列表中，然后解析它并添加或跳过。

即：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items=[] #list with your URLs
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items.append(link)   #add to list if it's not parsed yet
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

词典版本：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items={} #dictionary with your URLs as keys
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items[link]=1  #add to dictionary as key if it's not parsed yet (stored value can be anything)
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

P.S。也可以先收集items，再写入文件

此代码还有许多其他改进，但我留给您研究。

如何阻止我的爬虫记录重复项？

How to stop my crawler from logging duplicates?

python

web-crawler

scrapy