从 Scrapy 管道中删除重复项

Question

我的scrapy 爬虫从ptt 网站收集数据，并使用gspread 将爬取的数据输入到google 电子表格中。我的ptt蜘蛛每天在ptt网站上解析最新的40post，现在我想删除这个最新的40post中的重复数据，例如，如果post_title或post_link和昨天一样，那就不用把这个post解析成google电子表格了。
我知道我应该在 scarpy 中使用 DropItem，但实际上我不知道如何修复我的代码（我是 Python 的一个非常新的初学者），并且想为此寻求帮助，谢谢。

This is my ppt spider code

    # -*- coding: utf-8 -*-
    import scrapy
    # from scrapy.exceptions import CloseSpider
    from myFirstScrapyProject.items import MyfirstscrapyprojectItem
    
    class PttSpider(scrapy.Spider):
        count_page = 1
        name = 'ptt'
        allowed_domains = ['www.ptt.cc/']
        start_urls = ['https://www.ptt.cc/bbs/e-shopping/search?q=%E8%9D%A6%E7%9A%AE']+['https://www.ptt.cc/bbs/e-seller/search?q=%E8%9D%A6%E7%9A%AE']
        # start_urls = ['https://www.ptt.cc/bbs/e-shopping/index.html']
    
        def parse(self, response):
            items = MyfirstscrapyprojectItem()
            for q in response.css('div.r-ent'):
                items['push']=q.css('div.nrec > span.h1::text').extract_first()
                items['title']=q.css('div.title > a::text').extract_first()
                items['href']=q.css('div.title> a::attr(href)').extract_first()
                items['date']=q.css('div.meta > div.date ::text').extract_first()
                items['author']=q.css('div.meta > div.author ::text').extract_first()
                yield(items)

and this is my pipeline

from myFirstScrapyProject.exporters import GoogleSheetItemExporter
from scrapy.exceptions import DropItem

class MyfirstscrapyprojectPipeline(object):
    def open_spider(self, spider):
        self.exporter = GoogleSheetItemExporter()
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

thanks to sharmiko, i rewrite it, but it seems doesn't work, what should i fix?

from myFirstScrapyProject.exporters import GoogleSheetItemExporter
from scrapy.exceptions import DropItem

class MyfirstscrapyprojectPipeline(object):

    def open_spider(self, spider):
        self.exporter = GoogleSheetItemExporter()
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()

#    def process_item(self, item, spider):
#        self.exporter.export_item(item)
#        return item

#class DuplicatesTitlePipeline(object):
    def __init__(self):
        self.article = set()
    def process_item(self, item, spider):
        href = item['href'] 
        if href in self.article:
            raise DropItem('duplicates href found %s', item)
        self.exporter.export_item(item)
        return(item)

this is the code for export to google sheet

import gspread
from oauth2client.service_account import ServiceAccountCredentials
from scrapy.exporters import BaseItemExporter

class GoogleSheetItemExporter(BaseItemExporter):
    def __init__(self):
        scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
        credentials = ServiceAccountCredentials.from_json_keyfile_name('pythonupload.json', scope)
        gc = gspread.authorize(credentials)
        self.spreadsheet = gc.open('Community')
        self.worksheet = self.spreadsheet.get_worksheet(1)

    def export_item(self, item):
        self.worksheet.append_row([item['push'], item['title'], 
        item['href'],item['date'],item['author']])

Answer 1

您应该修改 process_item 函数来检查重复元素，如果找到，您可以将其删除。

from scrapy.exceptions import DropItem
...
def process_item(self, item, spider):
    if [ your duplicate check logic goes here]:
       raise DropItem('Duplicate element found')
    else:
       self.exporter.export_item(item)
       return item

丢弃的项目不再传递给其他管道组件。您可以阅读有关管道的更多信息 here.

从 Scrapy 管道中删除重复项

Dropping duplicate items from Scrapy pipeline

python

web-crawler

scrapy