单个 Scrapy 项目与多个项目

Question

我对如何存储我所有的蜘蛛感到进退两难。这些蜘蛛将通过命令行调用和从 stdin 读取的项目馈送到 Apache NiFi 中。我还计划在单独的网络服务器上使用 scrapyrt 获得这些蜘蛛 return 单项结果的子集。我将需要在具有不同项目模型的许多不同项目中创建蜘蛛。他们都有相似的设置（比如使用相同的代理）。

我的问题是构建我的 scrapy 项目的最佳方式是什么？

将所有蜘蛛放在同一个存储库中。提供了一种简单的方法来为项目加载器和项目管道创建基础类。
将我正在处理的每个项目的蜘蛛分组到单独的存储库中。 这样做的好处是允许项目成为每个项目的焦点并且不会变得太大。无法共享通用代码、设置、蜘蛛监视器 (spidermon) 和基础类。尽管有一些重复，但感觉最干净。
仅打包我计划在 NiFi 存储库中使用非实时的蜘蛛和另一个存储库中的实时蜘蛛。 的优势是我将蜘蛛与将要使用的项目一起使用实际上使用它们，但仍然 centralizes/convolutes 哪些蜘蛛与哪些项目一起使用。

感觉正确答案是#2。与特定程序相关的蜘蛛应该在他们自己的 scrapy 项目中，就像当您为项目 A 创建 Web 服务时，您不会说哦，我可以将项目 B 的所有服务端点都扔到同一个服务中，因为那是我的所有服务都将存在于其中，即使某些设置可能会重复。可以说一些共享 code/classes 可以通过单独的包共享。

你怎么看？你们都如何构建 scrapy 项目以最大限度地提高可重用性？您在哪里划定同一项目与单独项目的界限？它是基于您的 Item 模型还是数据源？

Answer 1

首先，当我写一个像'/path'这样的路径时，那是因为我是一个Ubuntu用户。如果您是 Windows 用户，请调整它。那是文件管理系统的问题。

光例

假设您想要抓取 2 个或更多不同的网站。第一个是泳装零售网站。二是关于天气。您想要抓取是因为您想要观察泳衣价格和天气之间的 link 以预测较低的购买价格。

注意 pipelines.py 我将使用 mongo 集合，因为这是我使用的，我暂时不需要 SQL。如果您不知道 mongo，请考虑集合相当于关系数据库中的 table。

scrapy 项目可能如下所示：

spiderswebsites.py ,这里可以写你想要的蜘蛛数量

import scrapy
from ..items.py import SwimItem, WeatherItem
#if sometimes you have trouble to import from parent directory you can do
#import sys
#sys.path.append('/path/parentDirectory')

class SwimSpider(scrapy.Spider):
    name = "swimsuit"
    start_urls = ['https://www.swimsuit.com']
    def parse (self, response):
        price = response.xpath('span[@class="price"]/text()').extract()
        model = response.xpath('span[@class="model"]/text()').extract()
        ... # and so on
        item = SwimItem() #needs to be called -> ()
        item['price'] = price
        item['model'] = model
        ... # and so on
        return item

class WeatherSpider(scrapy.Spider):
    name = "weather"
    start_urls = ['https://www.weather.com']
    def parse (self, response):
        temperature = response.xpath('span[@class="temp"]/text()').extract()
        cloud = response.xpath('span[@class="cloud_perc"]/text()').extract()
        ... # and so on
        item = WeatherItem() #needs to be called -> ()
        item['temperature'] = temperature
        item['cloud'] = cloud
        ... # and so on
        return item

items.py,这里可以写你想要的物品图案数量

import scrapy
class SwimItem(scrapy.Item):
    price = scrapy.Field()
    stock = scrapy.Field()
    ...
    model = scrapy.Field()

class WeatherItem(scrapy.Item):
    temperature = scrapy.Field()
    cloud = scrapy.Field()
    ...
    pressure = scrapy.Field()

pipelines.py ，我在这里使用 Mongo

import pymongo
from .items import SwimItem,WeatherItem
from .spiders.spiderswebsites import SwimSpider , WeatherSpider

class ScrapePipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod #this is a decorator, that's a powerful tool in Python
    def from_crawler(cls, crawler):
        return cls(
        mongo_uri=crawler.settings.get('MONGODB_URL'),
        mongo_db=crawler.settings.get('MONGODB_DB', 'defautlt-test')
        )
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        
    def close_spider(self, spider):
         self.client.close()

    def process_item(self, item, spider):
        if isinstance(spider, SwimItem):
            self.collection_name = 'swimwebsite'
        elif isinstance(spider, WeatherItem):
            self.collection_name = 'weatherwebsite'
        self.db[self.collection_name].insert(dict(item))

因此，当您查看我的示例项目时，您会发现该项目完全不依赖于项目模式，因为您可以在同一个项目中使用多种项目。在上面的模式中，优点是您可以根据需要在 settings.py 中保留相同的配置。但是不要忘记你可以“定制”你的蜘蛛的命令。如果你想让你的蜘蛛运行与默认设置稍有不同，你可以设置为 scrapy crawl spider -s DOWNLOAD_DELAY=35 而不是你在 settings.py 中写的 25 例如。

函数式编程

而且这里我的例子很轻。实际上，您很少对原始数据感兴趣。你需要很多代表很多线的治疗。为了提高代码的可读性，您可以在模块中创建函数。但要小心 spaghetti code.

functions.py , 自定义模块

from re import search

def cloud_temp(response): #for WeatherSpider
    """returns a tuple containing temperature and percentage of clouds"""
    temperature = response.xpath('span[@class="temp"]/text()').extract() #returns a str as " 12°C"
    cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() #returns a str as "30%"
    #treatments, you want to record it as integer
    temperature = int(re.search(r'[0-9]+',temperature).group()) #returns int as 12
    cloud = int(re.search(r'[0-9]+',cloud).group()) #returns int as 30
    return (cloud,temperature)

它给出 spiders.py

import scrapy
from items.py import SwimItem, WeatherItem
from functions.py import *
...
class WeatherSpider(scrapy.Spider):
    name = "weather"
    start_urls = ['https://www.weather.com']
    def parse (self, response):
        cloud , temperature = cloud_temp(response) "this is shorter than the previous one
        ... # and so on
        item = WeatherItem() #needs to be called -> ()
        item['temperature'] = temperature
        item['cloud'] = cloud
        ... # and so on
        return item

此外，它在调试方面提供了相当大的改进。假设我想制作一个 scrapy shell 会话。

>>> scrapy shell https://www.weather.com
...
#I check in the sys path if the directory where my `functions.py` module is present.
>>> import sys
>>> sys.path #returns a list of paths
>>> #if the directory is not present
>>> sys.path.insert(0, '/path/directory')
>>> #then I can now import my module in this session, and test in the shell, while I modify in the file functions.py itself
>>> from functions.py import *
>>> cloud_temp(response) #checking if it returns what I want.

这比复制粘贴一段代码更方便table。因为 Python 是一种很棒的函数式编程语言，您应该从中受益。这就是为什么我告诉你“更一般地说，如果你限制行数、提高可读性、限制错误，任何模式都是有效的。”它越可读，你就越能限制错误。你写的行数越少（比如避免复制和粘贴对不同变量的相同处理），你对 bug 的限制就越少。因为当您更正一个函数本身时，您就更正了所有依赖它的东西。

现在，如果您对函数式编程不是很满意table，我可以理解您为不同的项目模式制作了多个项目。您可以使用当前的技能并提高它们，然后随着时间的推移改进您的代码。

Answer 2

来自 Google 组主题“Single Scrapy Project vs. Multiple Projects for Various Sources”的 Jakob 推荐：

whether spiders should go into the same project is mainly determined by the type of data they scrape, and not by where the data comes from.

Say you are scraping user profiles from all your target sites, then you may have an item pipeline that cleans and validates user avatars, and one that exports them into your "avatars" database. It makes sense to put all spiders into the same project. After all, they all use the same pipelines because the data always has the same shape no matter where it was scraped from. On the other hand, if you are scraping questions from Stack Overflow, user profiles from Wikipedia, and issues from Github, and you validate/process/export all of these data types differently, it would make more sense to put the spiders into separate projects.

In other words, if your spiders have common dependencies (e.g. they share item definitions/pipelines/middlewares), they probably belong into the same project; if each of them has their own specific dependencies, they probably belong into separate projects.

Pablo Hoffman 是 Scrapy 的开发者之一，他在另一个帖子“Scrapy spider vs project”中回复：

...recommend to keep all spiders into the same project to improve code reusability (common code, helper functions, etc).

We've used prefixes on spider names at times, like film_spider1, film_spider2 actor_spider1, actor_spider2, etc. And sometimes we also write spiders that scrape multiple item types, as it makes more sense when there is a big overlap on the pages crawled.

单个 Scrapy 项目与多个项目

Single Scrapy Project vs. Multiple Projects

python

screen-scraping

scrapy

web-scraping

scrape

光例

函数式编程