Python Scrapy:你如何从一个单独的文件中 运行 你的蜘蛛?
Python Scrapy: How do you run your spider from a seperate file?
所以我在 scrapy 中创建了一个蜘蛛,它现在成功地定位了我想要的所有文本。
你如何在另一个 python 文件中执行这个爬虫?因为我希望能够向它传递新的 URLs/store 它在字典中找到的数据,然后是数据框。
因为目前我只能使用终端命令 'scrapy crawl SpiderName'
将其发送到 运行
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
class SpiderName(Spider):
name = 'SpiderName'
Page = 'https://www.urlname.com'
def start_requests(self):
yield SplashRequest(url=self.Page, callback=self.parse,
endpoint ='render.html',
args={'wait': 0.5},
)
def parse(self, response):
for x in response.css("div.row.list"):
yield {
'Entry': x.css("span[data-bind]::text").getall()
}
谢谢
在 Scrapy 文档中 Common Practices you can see Run Scrapy from a script
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# ... Your spider definition ...
# ... run it ...
process = CrawlerProcess(settings={ ... })
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
如果你自己添加__init__
class MySpider(scrapy.Spider):
def __init__(self, urls, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_urls = urls
然后你可以 运行 它以 urls
作为参数
process.crawl(MySpider, urls=['http://books.toscrape.com/', 'http://quotes.toscrape.com/'])
所以我在 scrapy 中创建了一个蜘蛛,它现在成功地定位了我想要的所有文本。
你如何在另一个 python 文件中执行这个爬虫?因为我希望能够向它传递新的 URLs/store 它在字典中找到的数据,然后是数据框。
因为目前我只能使用终端命令 'scrapy crawl SpiderName'
将其发送到 运行from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
class SpiderName(Spider):
name = 'SpiderName'
Page = 'https://www.urlname.com'
def start_requests(self):
yield SplashRequest(url=self.Page, callback=self.parse,
endpoint ='render.html',
args={'wait': 0.5},
)
def parse(self, response):
for x in response.css("div.row.list"):
yield {
'Entry': x.css("span[data-bind]::text").getall()
}
谢谢
在 Scrapy 文档中 Common Practices you can see Run Scrapy from a script
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# ... Your spider definition ...
# ... run it ...
process = CrawlerProcess(settings={ ... })
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
如果你自己添加__init__
class MySpider(scrapy.Spider):
def __init__(self, urls, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_urls = urls
然后你可以 运行 它以 urls
作为参数
process.crawl(MySpider, urls=['http://books.toscrape.com/', 'http://quotes.toscrape.com/'])