将域传递给 Scrapy Web 爬虫

Question

我正在构建一个网络爬虫，用户会先将 URL 输入到他们运行的脚本中，然后说脚本运行s 爬虫以及输入的域.我有一些清洁工作要做，但是我需要让原型运行。我已经编写了代码，结果是爬虫脚本一直在请求 URL。我尝试使用终端命令输入它，但我认为我的代码不兼容。有没有更好的方法来传递最终用户从另一个脚本输入的域？

# First script
import os

def userInput():
    user_input = raw_input("Please enter URL. Please do not include http://: ")
    os.system("scrapy runspider crawler_prod.py")

# Crawler Script

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider

from run_first import userInput

userInput()

class InputSpider(CrawlSpider):
        name = "Input"
        user_input = ""
        allowed_domains = [user_input]
        start_urls = ["http://" + user_input + "/"]

        # allow=() is used to match all links
        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

我运行它只是通过终端运行宁第一个脚本。一些帮助弄清楚如何将域作为变量传递会很好。

Answer 1

使用start_requests方法代替start_urls:

def start_requests(self):
    yield Request(url=self.user_input)

...

同时删除 allowed_domains class 变量，这样蜘蛛程序就可以允许它需要的所有域。

这样你就可以用 scrapy crawl myspider -a user_input="http://example.com"

调用蜘蛛

将域传递给 Scrapy Web 爬虫

Passing domain into Scrapy Web crawler

python

scrapy

scrapy-spider