Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'
Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'
我的抓取工具工作不正常,我找不到解决方法。
这是我的蜘蛛的相关部分:
def parse(self, response):
original_price=0
discounted_price=0
star=0
discounted_percent=0
try:
for product in response.xpath("//ul[@class='c-listing__items js-plp-products-list']/li"):
title= product.xpath(".//div/div[2]/div/div/a/text()").get()
if product.xpath(".//div/div[2]/div[2]/div[1]/text()"):
star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get()))
if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"):
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"):
discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
discounted_amount= original_price-discounted_price
else:
original_price= print("not available")
discounted_amount= print("not available")
url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/@href").get())
这是我的日志:
2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
ValueError: invalid literal for int() with base 10: 'تومان'
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 939,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 90506,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)}
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished)
我猜它说在 int() 函数中有一个字符串 returns ValueError 但我使用的 XPath 目标是一个数字,而不是一个字符串。
我无法正确获取错误,所以我没有找到解决方案。有人可以帮帮我吗?
在至少一次迭代中,这条线正在抓取 تومان
而不是整数
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
根据 google 搜索,这似乎是一个货币单位。你需要处理你的 XPaths,或者让蜘蛛忽略这个 return 因为这个项目没有折扣。
看起来这个 XPath 可能是更符合您的意图的选择:(不过我还没有检查所有项目)
product.xpath(".//div[@class="c-price__discount-oval"]/span/text()").get()
我的抓取工具工作不正常,我找不到解决方法。
这是我的蜘蛛的相关部分:
def parse(self, response):
original_price=0
discounted_price=0
star=0
discounted_percent=0
try:
for product in response.xpath("//ul[@class='c-listing__items js-plp-products-list']/li"):
title= product.xpath(".//div/div[2]/div/div/a/text()").get()
if product.xpath(".//div/div[2]/div[2]/div[1]/text()"):
star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get()))
if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"):
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"):
discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
discounted_amount= original_price-discounted_price
else:
original_price= print("not available")
discounted_amount= print("not available")
url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/@href").get())
这是我的日志:
2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
ValueError: invalid literal for int() with base 10: 'تومان'
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 939,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 90506,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)}
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished)
我猜它说在 int() 函数中有一个字符串 returns ValueError 但我使用的 XPath 目标是一个数字,而不是一个字符串。 我无法正确获取错误,所以我没有找到解决方案。有人可以帮帮我吗?
在至少一次迭代中,这条线正在抓取 تومان
而不是整数
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
根据 google 搜索,这似乎是一个货币单位。你需要处理你的 XPaths,或者让蜘蛛忽略这个 return 因为这个项目没有折扣。
看起来这个 XPath 可能是更符合您的意图的选择:(不过我还没有检查所有项目)
product.xpath(".//div[@class="c-price__discount-oval"]/span/text()").get()