运行 Splash 服务器和 Scrapy 蜘蛛在同一个 Ec2 实例上
Running Splash server and Scrapy spiders on the same Ec2 Instance
我正在部署一个由 Scrapy spiders that scrape content from websites as well as screenshot webpages with the Splash javascript 呈现服务组成的网络抓取应用程序。我想将整个应用程序部署到单个 Ec2 实例。但是要使应用程序正常工作,我必须 运行 来自 docker 图像的启动服务器,同时我 运行 正在使用我的蜘蛛。我如何 运行 一个 Ec2 实例上的多个进程?任何关于最佳实践的建议将不胜感激。
完全菜鸟问题。我找到了 运行 在 Ec2 实例上 运行 Splash 服务器和 Scrapy 蜘蛛的最佳方法,配置是通过 bash 脚本调度到 运行 和 cronjob。这是我想出的 bash 脚本:
#!bin/bash
# Change to proper directory to run Scrapy spiders.
cd /home/ec2-user/project_spider/project_spider
# Activate my virtual environment.
source /home/ec2-user/venv/python36/bin/activate # activate my virtual environment
# Create a shell variable to store date at runtime
LOGDATE=$(date +%Y%m%dT%H%M%S);
# Spin up splash instance from docker image.
sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600
# Scrape first site and store dated log file in logs directory.
scrapy crawl anhui --logfile /home/ec2-user/project_spider/project_spider/logs/anhui_spider/anhui_spider_$LOGDATE.log
...
# Spin down splash instance via docker image.
sudo docker rm $(sudo docker stop $(sudo docker ps -a -q --filter ancestor=scrapinghub/splash --format="{{.ID}}"))
# Exit virtual environment.
deactivate
# Send an email to confirm cronjob was successful.
# Note that sending email from Ec2 is difficult and you can not use 'MAILTO'
# in your cronjob without setting up something like postfix or sendmail.
# Using Mailgun is an easy way around that.
curl -s --user 'api:<YOURAPIHERE>' \
https://api.mailgun.net/v3/<YOURDOMAINHERE>/messages \
-F from='<YOURDOMAINADDRESS>' \
-F to=<RECIPIENT> \
-F subject='Cronjob Run Successfully' \
-F text='Cronjob completed.'
我正在部署一个由 Scrapy spiders that scrape content from websites as well as screenshot webpages with the Splash javascript 呈现服务组成的网络抓取应用程序。我想将整个应用程序部署到单个 Ec2 实例。但是要使应用程序正常工作,我必须 运行 来自 docker 图像的启动服务器,同时我 运行 正在使用我的蜘蛛。我如何 运行 一个 Ec2 实例上的多个进程?任何关于最佳实践的建议将不胜感激。
完全菜鸟问题。我找到了 运行 在 Ec2 实例上 运行 Splash 服务器和 Scrapy 蜘蛛的最佳方法,配置是通过 bash 脚本调度到 运行 和 cronjob。这是我想出的 bash 脚本:
#!bin/bash
# Change to proper directory to run Scrapy spiders.
cd /home/ec2-user/project_spider/project_spider
# Activate my virtual environment.
source /home/ec2-user/venv/python36/bin/activate # activate my virtual environment
# Create a shell variable to store date at runtime
LOGDATE=$(date +%Y%m%dT%H%M%S);
# Spin up splash instance from docker image.
sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600
# Scrape first site and store dated log file in logs directory.
scrapy crawl anhui --logfile /home/ec2-user/project_spider/project_spider/logs/anhui_spider/anhui_spider_$LOGDATE.log
...
# Spin down splash instance via docker image.
sudo docker rm $(sudo docker stop $(sudo docker ps -a -q --filter ancestor=scrapinghub/splash --format="{{.ID}}"))
# Exit virtual environment.
deactivate
# Send an email to confirm cronjob was successful.
# Note that sending email from Ec2 is difficult and you can not use 'MAILTO'
# in your cronjob without setting up something like postfix or sendmail.
# Using Mailgun is an easy way around that.
curl -s --user 'api:<YOURAPIHERE>' \
https://api.mailgun.net/v3/<YOURDOMAINHERE>/messages \
-F from='<YOURDOMAINADDRESS>' \
-F to=<RECIPIENT> \
-F subject='Cronjob Run Successfully' \
-F text='Cronjob completed.'