Scrapy:自定义记录器似乎覆盖了 Scrapy 记录器并使 Scrapy 记录器输出静音

Scrapy: custom loggers appear to be overriding Scrapy loggers and silencing Scrapy logger output

我正在尝试将我自己的记录器与我的 Scrapy 集成 项目。期望的结果是记录来自我的自定义记录器和 Scrapy 记录器到 stderr 所需的日志级别。我观察到 以下:

我的项目结构:

.
└── scraper
    ├── logging.cfg
    ├── main.py
    ├── scraper
    │   ├── __init__.py
    │   ├── forums.py
    │   ├── items.py
    │   ├── pipelines.py
    │   ├── runner.py
    │   ├── settings.py
    │   ├── spiders
    │   │   ├── __init__.py
    │   │   ├── forum.py
    │   │   └── someforum.py
    │   └── utils.py
    └── scrapy.cfg

该程序设计为从 main.py.

调用

main.py 的内容:

import os

from scraper import runner, utils
from scrapy.utils.project import get_project_settings

logger = utils.get_logger("launcher")


def main():
    """Entrypoint to the web scraper."""

    # Guarantee that we have the correct reference to the settings file
    os.environ.setdefault("SCRAPY_SETTINGS_MODULE", "scraper.settings")

    logger.info("Initializing spiders")
    runner.run_spiders(get_project_settings())


if __name__ == "__main__":
    main()

runner.py 的内容:

from scraper import forums, utils
from scraper.spiders.someforum import SomeForum
from scrapy.crawler import CrawlerProcess

logger = utils.get_logger("launcher")


def run_spiders(project_settings):

    process = CrawlerProcess(project_settings, install_root_handler=False)
    logger.info(
        f"Initialzing spider for {forums.someforum.get('forum_attrs').get('name')}"
    )
    process.crawl(
        SomeForum,
        **forums.someforum.get("forum_attrs"),
        post_attrs=forums.someforum.get("post_attrs"),
    )

    process.start()

logging.cfg 的内容:

[loggers]
keys=root,launcher,forum,item,pipeline

[logger_root]
level=DEBUG
handlers=basic

[logger_launcher]
level=DEBUG
handlers=basic
qualname=launcher
propagate=0

[logger_forum]
...

[logger_item]
...

[logger_pipeline]
...

# --

[logger_scrapy]
level=DEBUG

# -- 

[handlers]
keys=basic

[formatters]
keys=basic

[handler_basic]
class=StreamHandler
level=DEBUG
formatter=basic

[formatter_basic]
format=%(asctime)s - [%(name)s] - %(levelname)s - %(message)s

请注意,我希望能够在外部配置模块级日志记录详细信息 Python 文件。这就是为什么 logging.cfg 中有一个 Scrapy 的随机虚拟记录器,所以我 可以去 settings.py 文件中抓取并传递给底层 CrawlSpider.

settings.py 文件的内容:

import configparser

config = configparser.RawConfigParser()
config.read(LOG_CONF)

LOG_LEVEL = config["logger_scrapy"].get("level")
LOG_FORMAT = config["formatter_basic"].get("format")

BOT_NAME = "scraper"
SPIDER_MODULES = ["scraper.spiders"]
NEWSPIDER_MODULE = "scraper.spiders"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 8
ITEM_PIPELINES = {
    "scraper.pipelines.PostPipeline": 300,
}

函数utils.get_logger():

import logging

from settings import LOG_CONF
from logging import config

def get_logger(logger_name):
    """Returns logger"""

    # LOG_CONF just points to `logging.cfg` in root directory
    config.fileConfig(LOG_CONF)
    return logging.getLogger(logger_name)

如果我 运行 main.py 使用我的 launcher 记录器,日志输出将包含 像这样的东西:

2021-11-12 16:37:23,994 - [launcher] - INFO - Initializing spiders
2021-11-12 16:37:24,016 - [launcher] - INFO - Initialzing spider for someforum
2021-11-12 16:37:24,045 - [scrapy.extensions.telnet] - INFO - Telnet Password: 62df42034f3a7f09
2021-11-12 16:37:24,070 - [scaper.spiders.forums] - INFO - Creating scrape time log directory for forum: 'someforum'
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post title
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post currency
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post price
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Searching post title for tags
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post hash key

如果我 运行 main.py,但从 main.pyrunner.py,日志输出将包含如下内容:

2021-11-12 16:39:37,734 - [scrapy.utils.log] - INFO - Scrapy 2.5.1 started (bot: scraper)
2021-11-12 16:39:37,735 - [scrapy.utils.log] - INFO - Versions: lxml 4.6.4.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.2 (default, Dec 21 2020, 15:06:04) - [Clang 12.0.0 (clang-1200.0.32.29)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform macOS-10.15.7-x86_64-i386-64bit
2021-11-12 16:39:37,735 - [scrapy.utils.log] - DEBUG - Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-12 16:39:37,748 - [scrapy.crawler] - INFO - Overridden settings:
{'BOT_NAME': 'scraper',
'CONCURRENT_REQUESTS': 8,
'LOG_FORMAT': '%(asctime)s - [%(name)s] - %(levelname)s - %(message)s',
'NEWSPIDER_MODULE': 'scraper.spiders',
'SPIDER_MODULES': ['scraper.spiders']}
2021-11-12 16:39:37,769 - [scrapy.extensions.telnet] - INFO - Telnet Password: 2baaa967d9d68933
2021-11-12 16:39:37,792 - [scrapy.middleware] - INFO - Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']

我对 CrawlerProcess.crawl() 的调用中的 Scrapy 日志记录再次记录到 stderr,太棒了。我只是不想让我的日志和 Scrapy 日志成为 互斥。

我的直觉告诉我我搞砸了根记录器,或者我自己的记录器是 覆盖 Scrapy 记录器。我不知道如何解决这个问题, 所以任何 thoughts/suggestions 都会受到赞赏。提前致谢!

我终于明白了。 TLDR:默认情况下,调用 fileConfig() 会禁用所有现有记录器,这就是我在 get_logger() 函数中实例化记录器对象的方式。将此调用为 fileConfig(conf, disable_existing_loggers=False) 解决了问题,现在我可以看到来自所有记录器的日志记录。


我决定深入研究 Python 和 Scrapy 源代码,我注意到 Scrapy 源代码调用的任何记录器对象都有 disabled=True,这澄清了为什么什么都没有记录零碎。

下一个问题是“为什么所有的 Scrapy 记录器都和 disabled=True 在一起?” Google 出手相救,并向我指出了一个线程,其中有人指出调用 fileConfig() 会在调用时禁用 所有现有的记录器。

我最初以为disable_existing_loggers参数默认为False。根据Python docs,事实证明我的想法是倒退的。

现在我已经将 utils.py 中的 get_logger() 函数更新为:

import logging 

from settings import LOG_CONF
from logging import config

def get_logger(logger_name):
    """Returns logger"""

    config.fileConfig(LOG_CONF, disable_existing_loggers=False)
    return logging.getLogger(logger_name)

一切都很顺利。