Python、Scrapy、管道:函数 "process_item" 未被调用

Python,Scrapy, Pipeline: function "process_item" not getting called

我有一个非常简单的代码,如下所示。抓取没问题,我可以看到所有 print 语句都生成了正确的数据。在 Pipeline 中,初始化工作正常。但是,process_item 函数没有被调用,因为函数开头的 print 语句永远不会执行。

蜘蛛:comosham.py

import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from activityadvisor.items import ComoShamLocation
from activityadvisor.items import ComoShamActivity
from activityadvisor.items import ComoShamRates
import re


class ComoSham(Spider):
    name = "comosham"
    allowed_domains = ["www.comoshambhala.com"]
    start_urls = [
        "http://www.comoshambhala.com/singapore/classes/schedules",
        "http://www.comoshambhala.com/singapore/about/location-contact",
        "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes",
        "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes/rates-private-classes"
    ]

    def parse(self, response):  
        category = (response.url)[39:44]
        print 'in parse'
        if category == 'class':
            pass
            """self.gen_req_class(response)"""
        elif category == 'about':
            print 'about to call parse_location'
            self.parse_location(response)
        elif category == 'rates':
            pass
            """self.parse_rates(response)"""
        else:
            print 'Cant find appropriate category! check check check!! Am raising Level 5 ALARM - You are a MORON :D'


    def parse_location(self, response):
        print 'in parse_location'       
        item = ComoShamLocation()
        item['category'] = 'location'
        loc = Selector(response).xpath('((//div[@id = "node-2266"]/div/div/div)[1]/div/div/p//text())').extract()
        item['address'] = loc[2]+loc[3]+loc[4]+(loc[5])[1:11]
        item['pin'] = (loc[5])[11:18]
        item['phone'] = (loc[9])[6:20]
        item['fax'] = (loc[10])[6:20]
        item['email'] = loc[12]
        print item['address'],item['pin'],item['phone'],item['fax'],item['email']
        return item

项目文件:

import scrapy
from scrapy.item import Item, Field

class ComoShamLocation(Item):
    address = Field()
    pin = Field()
    phone = Field()
    fax = Field()
    email = Field()
    category = Field()

管道文件:

class ComoShamPipeline(object):
    def __init__(self):
        self.locationdump = csv.writer(open('./scraped data/ComoSham/ComoshamLocation.csv','wb'))
        self.locationdump.writerow(['Address','Pin','Phone','Fax','Email'])


    def process_item(self,item,spider):
        print 'processing item now'
        if item['category'] == 'location':
            print item['address'],item['pin'],item['phone'],item['fax'],item['email']
            self.locationdump.writerow([item['address'],item['pin'],item['phone'],item['fax'],item['email']])
        else:
            pass

你的问题是你从来没有真正交出物品。 parse_location returns 要解析的项目,但解析永远不会产生该项目。

解决方案是替换:

self.parse_location(response)

yield self.parse_location(response)

更具体地说,如果没有生成任何项目,process_item 永远不会被调用。

在settings.py中使用ITEM_PIPELINES:

ITEM_PIPELINES = ['project_name.pipelines.pipeline_class']

加上上面的答案,
1.记得在settings.py中加入下面这行! ITEM_PIPELINES = {'[YOUR_PROJECT_NAME].pipelines.[YOUR_PIPELINE_CLASS]': 300} 2.当你的蜘蛛跑的时候交出物品! yield my_item

这解决了我的问题: 在我的管道被调用之前我正在删除所有项目,所以 process_item() 没有被调用但是 open_spider 和 close_spider 被调用了。 所以我的解决方案只是更改顺序以在其他丢弃项目的管道之前使用此管道。

Scrapy Pipeline Documentation.

请记住,只有在有项目要处理时,Scrapy 才会调用 Pipeline.process_item()!