如何在将字符串连接到 URL 时运行多个 URL 上的蜘蛛

Question

我想要一个蜘蛛运行在多个 URL 上。但是我希望从用户那里获取输入，将其连接到我原来的 URL 然后让蜘蛛抓取它们。这是我正在为其中一个 URLs

做的事情

class ProductsSpider(scrapy.Spider):
    name = "gaming"

    def start_requests(self):
        product = input("Enter the item you are looking for")
        yield scrapy.Request(
            url=f'https://www.czone.com.pk/search.aspx?kw={product}',
            callback=self.parse
        )

    def parse(self, response):

上面的代码运行非常适合 URL。拥有多个 URLs 的一种方法是将列表设为 start_url 但在 URL 蜘蛛 returns 的情况下会出现错误 “[scrapy.core.engine] 错误：获取开始请求时出错 ValueError：请求中缺少方案 url：h” 请帮忙！

Answer 1

根据你的问题，解决方法如下：

我的代码：

import scrapy

class ProductsSpider(scrapy.Spider):
    
    name = "games"
    
    product = input("laptop")
    product2 = input("desktop")
    product3 = input("cameras")
    
    def start_requests(self):
        
        urls =[f'https://www.czone.com.pk/search.aspx?kw={self.product}', f'https://www.czone.com.pk/search.aspx?kw={self.product2}', f'https://www.czone.com.pk/search.aspx?kw={self.product3}']
        
        for url in urls:
            
            yield scrapy.Request(
                url =url,
                callback=self.parse
            )

    def parse(self, response):
        pass

与备选方案相同：

代码：

import scrapy
class ProductsSpider(scrapy.Spider):
    name = "games2"
    product = input(["laptop","desktop","cameras"])
    
    def start_requests(self):
        yield scrapy.Request(
            url=f'https://www.czone.com.pk/search.aspx?kw={self.product}',
            callback=self.parse
            )

    def parse(self, response):
        pass

输出：

laptop
desktop
cameras
['laptop', 'desktop', 'cameras']

2021-08-12 16:53:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.czone.com.pk/search.aspx?kw=> (referer: None)
2021-08-12 16:53:39 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-12 16:53:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 312,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 19982,
 'downloader/response_count': 1,
 'downloader/response_status_count/200

Answer 2

检查此代码：

import scrapy


class ProductsSpider(scrapy.Spider):
    name = "gaming"

    def __init__(self, product='', **kwargs):
        self.start_urls = [
            f'https://www.czone.com.pk/search.aspx?kw={product}',
            f'https://pcfanatics.pk/search?type=product&q={product}',
            f'https://gtstore.pk/searchresults.php?inputString={product}',
        ]
        super().__init__(**kwargs)

    def start_requests(self):
        for s_url in self.start_urls:
            yield scrapy.Request(
                url=s_url,
                callback=self.parse,
            )

    def parse(self, response):
        print(self.name)
        ... do parse things ...

在 scrapy 蜘蛛中获取输入的正确方法是在运行时使用 -a 选项，例如运行这个蜘蛛你应该使用：

scrapy crawl gaming -a product='foo'

或

scrapy runspider <spider_filename> -a product='foo'

您的网址错误可能是由于格式错误，使用

            f'https://www.czone.com.pk/search.aspx?kw={product}',
            f'https://pcfanatics.pk/search?type=product&q={product}',
            f'https://gtstore.pk/searchresults.php?inputString={product}',

我没有遇到任何问题。

如何在将字符串连接到 URL 时运行多个 URL 上的蜘蛛

How to run a spider on multiple URLs while concatenating a string to the URLs

python

scrapy

如何在将字符串连接到 URL 时 运行 多个 URL 上的蜘蛛

How to run a spider on multiple URLs while concatenating a string to the URLs

python

scrapy

如何在将字符串连接到 URL 时运行多个 URL 上的蜘蛛