在 scrapy 中 reactor.run() 之后有没有办法 运行 编码?
Is there a way to run code after reactor.run() in scrapy?
我正在开发一个 scrapy api。我的问题之一是扭曲的反应堆无法重启。我使用 crawl runner
而不是 crawl process
修复了这个问题。我的蜘蛛从网站中提取链接,验证它们。我的问题是,如果我在 reactor.run()
之后添加验证码,它就不起作用。这是我的代码:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
list = set([])
list_validate = set([])
runner = CrawlerRunner()
class Crawler(CrawlSpider):
name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
runner.crawl(Crawler)
reactor.run()
如果在 reactor.run()
之后添加验证链接的代码,它不会被执行。如果我把代码放在 reactor.run()
之前,什么也不会发生,因为蜘蛛还没有完成对所有链接的抓取。我应该怎么办?验证链接的代码非常好,我以前用过它并且有效。
对于您的 ScraperApi,您可以使用 Klein。
Klein is a micro-framework for developing production-ready web services with Python. It is 'micro' in that it has an incredibly small API similar to Bottle and Flask.
...
import scrapy
from scrapy.crawler import CrawlerRunner
from klein import Klein
app=Klein()
@app.route('/')
async def hello(request):
status=list()
class TestSpider(scrapy.Spider):
name='test'
start_urls=[
'https://quotes.toscrape.com/',
'https://quotes.toscrape.com/page/2/',
'https://quotes.toscrape.com/page/3/',
'https://quotes.toscrape.com/page/4/'
]
def parse(self,response):
"""
parse
"""
status.append(response.status)
runner=CrawlerRunner()
d= await runner.crawl(TestSpider)
content=str(status)
return content
@app.route('/h')
def index(request):
return 'Index Page'
app.run('localhost',8080)
我们可以用 d.addCallback(<callback_function>)
和 d.addErrback(<errback_function>)
...
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(d):
print("finished :D")
def spider_error(err):
print("Spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
我正在开发一个 scrapy api。我的问题之一是扭曲的反应堆无法重启。我使用 crawl runner
而不是 crawl process
修复了这个问题。我的蜘蛛从网站中提取链接,验证它们。我的问题是,如果我在 reactor.run()
之后添加验证码,它就不起作用。这是我的代码:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
list = set([])
list_validate = set([])
runner = CrawlerRunner()
class Crawler(CrawlSpider):
name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
runner.crawl(Crawler)
reactor.run()
如果在 reactor.run()
之后添加验证链接的代码,它不会被执行。如果我把代码放在 reactor.run()
之前,什么也不会发生,因为蜘蛛还没有完成对所有链接的抓取。我应该怎么办?验证链接的代码非常好,我以前用过它并且有效。
对于您的 ScraperApi,您可以使用 Klein。
Klein is a micro-framework for developing production-ready web services with Python. It is 'micro' in that it has an incredibly small API similar to Bottle and Flask.
...
import scrapy
from scrapy.crawler import CrawlerRunner
from klein import Klein
app=Klein()
@app.route('/')
async def hello(request):
status=list()
class TestSpider(scrapy.Spider):
name='test'
start_urls=[
'https://quotes.toscrape.com/',
'https://quotes.toscrape.com/page/2/',
'https://quotes.toscrape.com/page/3/',
'https://quotes.toscrape.com/page/4/'
]
def parse(self,response):
"""
parse
"""
status.append(response.status)
runner=CrawlerRunner()
d= await runner.crawl(TestSpider)
content=str(status)
return content
@app.route('/h')
def index(request):
return 'Index Page'
app.run('localhost',8080)
我们可以用 d.addCallback(<callback_function>)
和 d.addErrback(<errback_function>)
...
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(d):
print("finished :D")
def spider_error(err):
print("Spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()