具有不同变量的 Scrapy 队列同一个蜘蛛
Scrapy queue same spider with different variable
我有一个 url 结尾的列表 我想像这样在 .csv 文件中抓取:
run
123
124
125
我想 运行 所有这些都在一个有序队列中的蜘蛛中。所以 运行 MySpider 为 123,完成后,运行 MySpider 为 124,依此类推。
类似于:
process=CrawlerProcess()
with open('run.csv') as rows:
for row in DictReader(rows):
process.crawl(numbers(row['run']))
process.start()
但是运行一个接一个。我需要将 .csv 文件中的变量 row['run'] 传递到蜘蛛中以供使用。
这是一个示例蜘蛛:
class MySpider(scrapy.Spider):
low=row['run']
high=row['run']+1000
self.start_urls=['http://www.canada411.ca/res/%s/' % page for page in xrange(low,high)]
def parse(self,response):
yield{
'Number': row['run'],
'Name': SCRAPPED
}
process=CrawlerProcess()
with open('run.csv') as rows:
for row in DictReader(rows):
process.crawl(numbers)
process.start()
这是一个例子:
https://doc.scrapy.org/en/latest/topics/practices.html
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
runner=CrawlerRunner()
@defer.inlineCallbacks
def crawl():
with open('run.csv') as rows:
for row in DictReader(rows):
yield runner.crawl(numbers,areacode=row['area code'])
reactor.stop()
我使用 scrapydo 包中的 run_spider() 来实现这个
https://pypi.python.org/pypi/scrapydo/0.2.2
我有一个 url 结尾的列表 我想像这样在 .csv 文件中抓取:
run
123
124
125
我想 运行 所有这些都在一个有序队列中的蜘蛛中。所以 运行 MySpider 为 123,完成后,运行 MySpider 为 124,依此类推。
类似于:
process=CrawlerProcess()
with open('run.csv') as rows:
for row in DictReader(rows):
process.crawl(numbers(row['run']))
process.start()
但是运行一个接一个。我需要将 .csv 文件中的变量 row['run'] 传递到蜘蛛中以供使用。
这是一个示例蜘蛛:
class MySpider(scrapy.Spider):
low=row['run']
high=row['run']+1000
self.start_urls=['http://www.canada411.ca/res/%s/' % page for page in xrange(low,high)]
def parse(self,response):
yield{
'Number': row['run'],
'Name': SCRAPPED
}
process=CrawlerProcess()
with open('run.csv') as rows:
for row in DictReader(rows):
process.crawl(numbers)
process.start()
这是一个例子:
https://doc.scrapy.org/en/latest/topics/practices.html
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
runner=CrawlerRunner()
@defer.inlineCallbacks
def crawl():
with open('run.csv') as rows:
for row in DictReader(rows):
yield runner.crawl(numbers,areacode=row['area code'])
reactor.stop()
我使用 scrapydo 包中的 run_spider() 来实现这个 https://pypi.python.org/pypi/scrapydo/0.2.2