在scrapy中间件中实例化数据库连接,在其他模块中访问

Instantiate database connection in scrapy middleware and access it in other modules

我在一个项目中有几个不同的蜘蛛共享同一个数据库,我有不同的项目 类 所以我可以在管道中正确处理它们并将它们发送到所需的目的地。在我的第一个蜘蛛中,数据库在管道中是这样实例化的:

def __init__(self, database, user, password, host, port):
    self.database = database
    self.user = user
    self.password = password
    self.host = host
    self.port = port

@classmethod
def from_crawler(cls, crawler):
    db_settings = crawler.settings.getdict("DB_SETTINGS")
    if not db_settings:
        raise NotConfigured
    db = db_settings['database']
    user = db_settings['user']
    password = db_settings['password']
    host = db_settings['host']
    port = db_settings['port']
    return cls(db, user, password, host, port)

def open_spider(self, spider):
    self.connection = psycopg2.connect(database=self.database, user=self.user, password=self.password,
                                       host=self.host, port=self.port)
    self.cursor = self.connection.cursor()

def close_spider(self, spider):
    self.cursor_close = self.cursor.close()
    self.connection_close = self.connection.close()

效果很好,但是对于我的第二个蜘蛛,我需要从蜘蛛本身的数据库中访问一些数据,这样我才能开始爬行,然后将项目发送到管道以将它们保存在数据库中。 我可以使用相同的代码在蜘蛛中实例化数据库并停止在管道中执行它,但是有多个蜘蛛,我不想一遍又一遍地重复这个过程。我想知道如何在中间件中实例化数据库连接并在蜘蛛和管道中访问它。我想我可以使用上面相同的代码来启动数据库芽我不知道如何调整它来访问游标和蜘蛛和管道中的连接

这就是我让它工作的方式,你可以像这样在中间件中做到这一点:

## MiddleWare

class DBMiddleware(object):

    def __init__(self, db_settings):
        self.db_setting = db_settings

    @classmethod
    def from_crawler(cls, crawler):
        db_settings = crawler.settings.getdict("DB_SETTINGS")
        if not db_settings:  # if we don't define db config in settings
            raise NotConfigured

        s = cls(db_settings)
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s

    def spider_opened(self, spider):
        spider.connection = psycopg2.connect(database=self.db_setting['database'],
                                           user=self.db_setting['user'],
                                           password=self.db_setting['password'],
                                           host=self.db_setting['host'],
                                           port=self.db_setting['port'])


    def spider_closed(self, spider):
        spider.connection.close()

然后你可以将它添加到蜘蛛中以访问刚刚创建的连接

## Spider
class MainSpider(scrapy.Spider):

    name = 'main_spider'
    start_urls = ['www.example.com']

    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s 

    def spider_opened(self, spider):
        pass

    def parse(self, response):
        cursor = self.connection.cursor()
        sql = "SELECT * FROM companies"
        cursor.execute(sql)
        result = cursor.fetchall()
        for element in result:
            loader = ItemLoader(item=Item_profile(), selector=element)
            loader.add_value('name', element[0])
            items = loader.load_item()
            yield items

    def spider_closed(self, spider):
        pass

如果你只是想在 spider pars 方法中访问 db 连接,这很好用,但我需要的是在 parse 方法之前打开连接,这样我就可以从 db 和 crwal 中检索链接它们是单独的,所以我需要##Spider spider_opened() 方法中的连接,但是方法激活的顺序是这样的:

1: #Spider __init__()
2: #Spider spider_opened()
3: #Middleware spider_opened() -->> connection is  created here
4: #Spider parse()
5: #Spider spider_closed()
6: #Middleware spider_closed()

这是合乎逻辑的,因为根据 Documentation the main function of Middleware is to sit between the engine and the spider, what we need is a module that instantiates at scrapy startup, and that would be an Extension。所以我在中间件、管道等根目录下创建了一个名为 extentions.py 的文件,并添加了与中间件相同的代码:

from scrapy import signals
from scrapy.exceptions import NotConfigured
import psycopg2


class DBExtension(object):
    def __init__(self, db_settings):
        self.db_setting = db_settings
        pass

    @classmethod
    def from_crawler(cls, crawler):
        db_settings = crawler.settings.getdict("DB_SETTINGS")
        if not db_settings:
            raise NotConfigured
        s = cls(db_settings)
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return (s)

    def spider_opened(self, spider):
        spider.connection = psycopg2.connect(database=self.db_setting['database'],
                                           user=self.db_setting['user'],
                                           password=self.db_setting['password'],
                                           host=self.db_setting['host'],
                                           port=self.db_setting['port'])


    def spider_closed(self, spider):
        spider.connection.close()

然后我在 settings.py

中激活了这个扩展
EXTENSIONS = {
    'ProjectName.extensions.DBExtension': 400
}

现在您可以使用 self.connection 在 #Spider spider_opened() 方法中访问此连接,并且可以在抓取开始之前从数据库加载信息。 我不知道是否有更好的方法来解决这个问题,但现在这对我来说已经足够了。