如何分离 python 应用程序的两个组件

Question

我正在尝试学习 python 开发，并且我一直在阅读有关架构模式和代码设计的主题，因为我想停止黑客攻击并真正进行开发。我正在实现一个网络爬虫，我知道它的结构有问题，如您所见，但我不知道如何修复它。

爬虫将 return 一个动作列表以在 mongoDB 实例中输入数据。

这是我的应用程序的一般结构：

Spiders

crawlers.py
connections.py
utils.py
__init__.py

crawlers.py实现了一个Crawler类型的class，每个特定的爬虫都继承了它。每个 Crawler 都有一个属性 table_name 和一个方法：crawl。在 connections.py 中，我实现了一个 pymongo 驱动程序来连接到数据库。它需要一个爬虫作为它的 write 方法的参数。现在是技巧部分...... crawler2 取决于 crawler1 的结果，所以我最终得到这样的结果：

from pymongo import InsertOne

class crawler1(Crawler):
    def __init__(self):
        super().__init__('Crawler 1', 'table_A')

    def crawl(self):
        return list of InsertOne

class crawler2(Crawler):
    def __init__(self):
        super().__init__('Crawler 2', 'table_B')

    def crawl(self, list_of_codes):
        return list of InsertOne # After crawling the list of codes/links

然后，在我的连接中，我创建了一个需要爬虫的 class。

class MongoDriver:
    def __init__.py
        self.db = MongoClient(...)

    def write(crawler, **kwargs):
        self.db[crawler.table_name].bulk_write(crawler.crawl(**kwargs))

    def get_list_of_codes():
        query = {}
        return [x['field'] for x in self.db.find(query)]

因此，（最大的）问题来了（因为我认为还有很多其他问题，其中一些我几乎无法理解，而另一些我仍然完全视而不见）：实现我的连接需要爬虫的上下文！！例如：

mongo_driver = MongoDriver()
crawler1 = Crawler1()
crawler2 = Crawler2()
mongo_driver.write(crawler1)
mongo_driver.write(crawler2, list_of_codes=mongo_driver.get_list_of_codes())

如何解决它？在这个结构中还有什么特别令人担忧的？感谢反馈！

Answer 1

问题 1：MongoDriver 对您的爬虫了解得太多了。您应该将驱动程序与 crawler1 和 crawler2 分开。我不确定你的 crawl 函数 returns 是什么，但我认为它是 A.

类型的对象列表

您可以使用 CrawlerService 等对象来管理 MongoDriver 和 Crawler 之间的依赖关系。这会将驱动程序的写入责任与爬虫的爬取责任分开。该服务还将管理操作顺序，这在某些情况下可能被认为已经足够好了。

class Repository:

    def write(for_table: str, objects: 'List[A]'):
        self.db[for_table].bulk_write(objects)

class CrawlerService:

    def __init__(self, repository: Repository, crawlers: List[Crawler]):
        ...
   
    def crawl(self):
        crawler1, crawler2 = crawlers
        result = [repository.write(x) for x in crawler1.crawl()]
        ... # work with crawler2 and result

问题2：Crawler1和Crawler2几乎是一样的；它们仅在我们调用 crawl 函数时有所不同。考虑到DRY原则，可以将爬虫算法分离成策略等对象，让一个Crawler依赖（组合）

class CrawlStrategy(ABC):
    
    @abstractmethod
    def crawl(self) -> List[A]:
        pass
    
class CrawlStrategyA(CrawlStrategy):
    
    def crawl(self) -> List[A]:
        ...

class CrawlStrategyB(CrawlStrategy):
    
    def __init__(self, codes: List[int]):
        self.__codes = codes
    
    def crawl(self) -> List[A]:
        ...

    
class Crawler(ABC):
    
    def __init__(self, name: str, strategy: 'CrawlStrategy'):
        self.__name = name
        self.__strategy = strategy
       
    def crawl(self) -> List[int]:
        return self.__strategy.crawl()

通过这样做，Crawler的结构（例如table名称等）只存在一个地方，您可以稍后扩展它。

问题 3：从这里开始，您有多种方法来改进整体设计。您可以通过创建取决于您的数据库连接的新策略来删除 CrawlService。要表示一种策略依赖于另一种策略（例如 crawler1 产生 crawler2 的结果），您可以将两种策略组合在一起，例如：


class StrategyA(Strategy):
   
     def __init__(self, other: Strategy, database: DB):
          self.__other = other
          self.__db = database
    
     def crawl(self) -> 'List[A]':
          result = self.__other.crawl()
          self.__db.write(result)
          xs = self.__db.find(...)
          # do something with xs
          ...

当然，这是一个简化的示例，但会消除数据库连接和爬网程序之间对单个中介的需求，并且会提供更大的灵活性。此外，整体设计更易于测试，因为您所要做的就是对策略对象进行单元测试（并且您可以轻松模拟数据库连接以进行 DI）。

从这一点来看，改进整体设计的后续步骤在很大程度上取决于实施的复杂性以及您总体上需要多大的灵活性。

PS：除了策略模式，您还可以尝试其他替代方案，也许取决于您拥有的爬虫数量及其一般结构，您将不得不使用装饰器模式。

如何分离 python 应用程序的两个组件

How can I de-couple the two components of my python application

python

architecture

design-patterns

web-crawler

python-3.x