将列表作为 Scrapy 抓取器的参数

Giving a list as an argument to a Scrapy scraper

我希望能够提供一个 url 列表作为我的 scrapy scraper 的参数,这样我就可以定期迭代它以避免 403 错误。目前我认为 Scrapy 不允许我这样做。

scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']

或者一个 url 文件。

目前这些 url 很难写在我的蜘蛛中:

import scrapy
from ..pipelines import NosetimeScraperPipeline
import time

headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; TencentTraveler 4.0; Trident/4.0; SLCC1; Media Center PC 5.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'}
base_url = 'https://www.nosetime.com'

class NosetimeScraper(scrapy.Spider):
    name = "nosetime"

    urls = ['/pinpai/10036120-yuguoboshi-hugo-boss.html', # I want to get rid of this
            '/pinpai/10094164-kedi-coty.html',            # unless I can use something like time.sleep(12*60*60)
            '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', # for each before being taken as argument
            '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']

    start_urls = ['https://www.nosetime.com' + url for url in urls]
    base_url = 'https://www.nosetime.com'

    def parse(self, response):
        # proceed to other pages of the listings
        urls = response.css('a.imgborder::attr(href)').getall()
        for url in urls:
            print("url: ", url)
            yield scrapy.Request(url=base_url + url, callback=self.parse)

        # now that we have the urls we need to know if the dire are the things we can scrape
        pipeline = NosetimeScraperPipeline()
        perfume = pipeline.process_response(response)
        try:
            if perfume['enname']:
                print("Finally are going to store: ", perfume['enname'])
                pipeline.save_in_mongo(perfume)
        except KeyError:
            pass

Scrapy documentation 中有一个非常简单的示例,您可以对其进行调整以获取包含 URL 列表的文件名:

scrapy crawl myspider -a urls_file=URLs.txt

   def __init__(self, urls_file=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.urls_file = urls_file
        # ...
   def start_requests(self):
       with open(self.urls_file, 'r') as f:
       # read and yield your URLs here