如何在将字符串连接到 URL 时 运行 多个 URL 上的蜘蛛
How to run a spider on multiple URLs while concatenating a string to the URLs
我想要一个蜘蛛 运行 在多个 URL 上。但是我希望从用户那里获取输入,将其连接到我原来的 URL 然后让蜘蛛抓取它们。这是我正在为其中一个 URLs
做的事情
class ProductsSpider(scrapy.Spider):
name = "gaming"
def start_requests(self):
product = input("Enter the item you are looking for")
yield scrapy.Request(
url=f'https://www.czone.com.pk/search.aspx?kw={product}',
callback=self.parse
)
def parse(self, response):
上面的代码 运行 非常适合 URL。拥有多个 URLs 的一种方法是将列表设为 start_url 但在 URL 蜘蛛 returns 的情况下会出现错误
“[scrapy.core.engine] 错误:获取开始请求时出错 ValueError:请求中缺少方案 url:h”
请帮忙!
根据你的问题,解决方法如下:
我的代码:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "games"
product = input("laptop")
product2 = input("desktop")
product3 = input("cameras")
def start_requests(self):
urls =[f'https://www.czone.com.pk/search.aspx?kw={self.product}', f'https://www.czone.com.pk/search.aspx?kw={self.product2}', f'https://www.czone.com.pk/search.aspx?kw={self.product3}']
for url in urls:
yield scrapy.Request(
url =url,
callback=self.parse
)
def parse(self, response):
pass
与备选方案相同:
代码:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "games2"
product = input(["laptop","desktop","cameras"])
def start_requests(self):
yield scrapy.Request(
url=f'https://www.czone.com.pk/search.aspx?kw={self.product}',
callback=self.parse
)
def parse(self, response):
pass
输出:
laptop
desktop
cameras
['laptop', 'desktop', 'cameras']
2021-08-12 16:53:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.czone.com.pk/search.aspx?kw=> (referer: None)
2021-08-12 16:53:39 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-12 16:53:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 312,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19982,
'downloader/response_count': 1,
'downloader/response_status_count/200
检查此代码:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "gaming"
def __init__(self, product='', **kwargs):
self.start_urls = [
f'https://www.czone.com.pk/search.aspx?kw={product}',
f'https://pcfanatics.pk/search?type=product&q={product}',
f'https://gtstore.pk/searchresults.php?inputString={product}',
]
super().__init__(**kwargs)
def start_requests(self):
for s_url in self.start_urls:
yield scrapy.Request(
url=s_url,
callback=self.parse,
)
def parse(self, response):
print(self.name)
... do parse things ...
在 scrapy 蜘蛛中获取输入的正确方法是在 运行 时使用 -a
选项,例如 运行 这个蜘蛛你应该使用:
scrapy crawl gaming -a product='foo'
或
scrapy runspider <spider_filename> -a product='foo'
您的网址错误可能是由于格式错误,使用
f'https://www.czone.com.pk/search.aspx?kw={product}',
f'https://pcfanatics.pk/search?type=product&q={product}',
f'https://gtstore.pk/searchresults.php?inputString={product}',
我没有遇到任何问题。
我想要一个蜘蛛 运行 在多个 URL 上。但是我希望从用户那里获取输入,将其连接到我原来的 URL 然后让蜘蛛抓取它们。这是我正在为其中一个 URLs
做的事情class ProductsSpider(scrapy.Spider):
name = "gaming"
def start_requests(self):
product = input("Enter the item you are looking for")
yield scrapy.Request(
url=f'https://www.czone.com.pk/search.aspx?kw={product}',
callback=self.parse
)
def parse(self, response):
上面的代码 运行 非常适合 URL。拥有多个 URLs 的一种方法是将列表设为 start_url 但在 URL 蜘蛛 returns 的情况下会出现错误 “[scrapy.core.engine] 错误:获取开始请求时出错 ValueError:请求中缺少方案 url:h” 请帮忙!
根据你的问题,解决方法如下:
我的代码:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "games"
product = input("laptop")
product2 = input("desktop")
product3 = input("cameras")
def start_requests(self):
urls =[f'https://www.czone.com.pk/search.aspx?kw={self.product}', f'https://www.czone.com.pk/search.aspx?kw={self.product2}', f'https://www.czone.com.pk/search.aspx?kw={self.product3}']
for url in urls:
yield scrapy.Request(
url =url,
callback=self.parse
)
def parse(self, response):
pass
与备选方案相同:
代码:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "games2"
product = input(["laptop","desktop","cameras"])
def start_requests(self):
yield scrapy.Request(
url=f'https://www.czone.com.pk/search.aspx?kw={self.product}',
callback=self.parse
)
def parse(self, response):
pass
输出:
laptop
desktop
cameras
['laptop', 'desktop', 'cameras']
2021-08-12 16:53:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.czone.com.pk/search.aspx?kw=> (referer: None)
2021-08-12 16:53:39 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-12 16:53:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 312,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19982,
'downloader/response_count': 1,
'downloader/response_status_count/200
检查此代码:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "gaming"
def __init__(self, product='', **kwargs):
self.start_urls = [
f'https://www.czone.com.pk/search.aspx?kw={product}',
f'https://pcfanatics.pk/search?type=product&q={product}',
f'https://gtstore.pk/searchresults.php?inputString={product}',
]
super().__init__(**kwargs)
def start_requests(self):
for s_url in self.start_urls:
yield scrapy.Request(
url=s_url,
callback=self.parse,
)
def parse(self, response):
print(self.name)
... do parse things ...
在 scrapy 蜘蛛中获取输入的正确方法是在 运行 时使用 -a
选项,例如 运行 这个蜘蛛你应该使用:
scrapy crawl gaming -a product='foo'
或
scrapy runspider <spider_filename> -a product='foo'
您的网址错误可能是由于格式错误,使用
f'https://www.czone.com.pk/search.aspx?kw={product}',
f'https://pcfanatics.pk/search?type=product&q={product}',
f'https://gtstore.pk/searchresults.php?inputString={product}',
我没有遇到任何问题。