scrapy 使用 CrawlerProcess.crawl() 从脚本将 custom_settings 传递给蜘蛛
scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()
我正在尝试通过脚本以编程方式调用蜘蛛。我无法使用 CrawlerProcess 通过构造函数覆盖设置。让我用从官方 scrapy 站点抓取引号的默认蜘蛛来说明这一点(最后一个代码片段位于 official scrapy quotes example spider)。
class QuotesSpider(Spider):
name = "quotes"
def __init__(self, somestring, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.somestring = somestring
self.custom_settings = kwargs
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
这是我尝试 运行 引用蜘蛛
的脚本
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
def main():
proc = CrawlerProcess(get_project_settings())
custom_settings_spider = \
{
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
}
proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()
Scrapy 设置有点像 Python 字典。
因此,您可以在将设置对象传递给 CrawlerProcess
:
之前更新它
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
def main():
s = get_project_settings()
s.update({
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
})
proc = CrawlerProcess(s)
proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()
根据 OP 的评论进行编辑:
这是一个使用 CrawlerRunner
的变体,每次抓取都有一个新的 CrawlerRunner
,并在每次迭代时重新配置日志记录以每次写入不同的文件:
import logging
from twisted.internet import reactor, defer
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging, _get_handler
from scrapy.utils.project import get_project_settings
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
page = getattr(self, 'page', 1)
yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page),
self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
@defer.inlineCallbacks
def crawl():
s = get_project_settings()
for i in range(1, 4):
s.update({
'FEED_URI': 'quotes%03d.csv' % i,
'LOG_FILE': 'quotes%03d.log' % i
})
# manually configure logging for LOG_FILE
configure_logging(settings=s, install_root_handler=False)
logging.root.setLevel(logging.NOTSET)
handler = _get_handler(s)
logging.root.addHandler(handler)
runner = CrawlerRunner(s)
yield runner.crawl(QuotesSpider, page=i)
# reset root handler
logging.root.removeHandler(handler)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
我认为在将 Spider 作为脚本调用时不能覆盖 Spider Class 的 custom_settings
变量,主要是因为在实例化 Spider 之前加载了设置。
现在,我真的没有看到具体更改 custom_settings
变量的意义,因为它只是覆盖默认设置的一种方法,而这正是 CrawlerProcess
提供的,这按预期工作:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'simple'
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
for k, v in self.settings.items():
print('{}: {}'.format(k, v))
yield {
'headers': response.body
}
process = CrawlerProcess({
'USER_AGENT': 'my custom user anget',
'ANYKEY': 'any value',
})
process.crawl(MySpider)
process.start()
您似乎想要为每个蜘蛛程序自定义日志。您需要像这样激活日志记录:
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#ommited
def __init__(self):
configure_logging({'LOG_FILE' : "logs/mylog.log"})
您可以从命令行覆盖设置
https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options
例如:scrapy crawl myspider -s LOG_FILE=scrapy.log
我正在尝试通过脚本以编程方式调用蜘蛛。我无法使用 CrawlerProcess 通过构造函数覆盖设置。让我用从官方 scrapy 站点抓取引号的默认蜘蛛来说明这一点(最后一个代码片段位于 official scrapy quotes example spider)。
class QuotesSpider(Spider):
name = "quotes"
def __init__(self, somestring, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.somestring = somestring
self.custom_settings = kwargs
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
这是我尝试 运行 引用蜘蛛
的脚本from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
def main():
proc = CrawlerProcess(get_project_settings())
custom_settings_spider = \
{
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
}
proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()
Scrapy 设置有点像 Python 字典。
因此,您可以在将设置对象传递给 CrawlerProcess
:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
def main():
s = get_project_settings()
s.update({
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
})
proc = CrawlerProcess(s)
proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()
根据 OP 的评论进行编辑:
这是一个使用 CrawlerRunner
的变体,每次抓取都有一个新的 CrawlerRunner
,并在每次迭代时重新配置日志记录以每次写入不同的文件:
import logging
from twisted.internet import reactor, defer
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging, _get_handler
from scrapy.utils.project import get_project_settings
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
page = getattr(self, 'page', 1)
yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page),
self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
@defer.inlineCallbacks
def crawl():
s = get_project_settings()
for i in range(1, 4):
s.update({
'FEED_URI': 'quotes%03d.csv' % i,
'LOG_FILE': 'quotes%03d.log' % i
})
# manually configure logging for LOG_FILE
configure_logging(settings=s, install_root_handler=False)
logging.root.setLevel(logging.NOTSET)
handler = _get_handler(s)
logging.root.addHandler(handler)
runner = CrawlerRunner(s)
yield runner.crawl(QuotesSpider, page=i)
# reset root handler
logging.root.removeHandler(handler)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
我认为在将 Spider 作为脚本调用时不能覆盖 Spider Class 的 custom_settings
变量,主要是因为在实例化 Spider 之前加载了设置。
现在,我真的没有看到具体更改 custom_settings
变量的意义,因为它只是覆盖默认设置的一种方法,而这正是 CrawlerProcess
提供的,这按预期工作:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'simple'
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
for k, v in self.settings.items():
print('{}: {}'.format(k, v))
yield {
'headers': response.body
}
process = CrawlerProcess({
'USER_AGENT': 'my custom user anget',
'ANYKEY': 'any value',
})
process.crawl(MySpider)
process.start()
您似乎想要为每个蜘蛛程序自定义日志。您需要像这样激活日志记录:
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#ommited
def __init__(self):
configure_logging({'LOG_FILE' : "logs/mylog.log"})
您可以从命令行覆盖设置
https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options
例如:scrapy crawl myspider -s LOG_FILE=scrapy.log