Scrapy KeyError: 0 parse next page
Scrapy KeyError: 0 parse next page
设置
我正在使用 Scrapy 和 Python 3.7
为它的房屋广告抓取 this website
每个房屋广告,我都会获得房屋的特征,例如尺寸、价格等。
该网站每页显示 10 个广告,因此我需要遍历页面。
我的代码
class DwellingSpider(scrapy.Spider):
name = 'dwelling'
start_urls = list(df['spitogatos_url'])
def parse(self, response):
# obtain ad list element
result_list = response.css('#searchDetailsListings')
# select ad from list
for ad in result_list.xpath('div'):
# here's the code to obtain characterstics per ad and yield them
# obtain next page url
next_page = response.css('#pagination > ul > li.next > a').xpath('@href').extract_first()
# send next page url to parse function
if len(next_page) > 0:
yield scrapy.Request(str(next_page), callback=self.parse)
其中 list(df['spitogatos_url'])
是一个包含我要抓取的 X url 的列表,看起来像
['https://en.spitogatos.gr/search/results/residential/sale/r194/m194m?ref=refinedSearchSR',
'https://en.spitogatos.gr/search/results/residential/sale/r153/m153m187m?ref=refinedSearchSR']
问题
获取每个广告作品的房屋特征。
问题在于GET
正确调整下一页,
[scrapy.core.scraper] ERROR: Spider error processing <GET https://en.spitogatos.gr/search/results/residential/sale/r194/m194m/offset_10> (referer: https://en.spitogatos.gr/search/results/residential/sale/r194/m194m?ref=refinedSearchSR)
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
我不确定是什么原因造成的 KeyError: 0
以及如何解决它。
有什么想法吗?
编辑
我发现如果我使用 next_page_url
作为起点,即
start_urls = ['https://en.spitogatos.gr/search/results/residential/sale/r177/m177m183m/offset_10']
我立即得到同样的错误,
ERROR: Spider error processing <GET https://en.spitogatos.gr/search/results/residential/sale/r177/m177m183m/offset_10> (referer: None)
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
试试这个
next_page = response.css('[rel="next"] ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(str(next_page), callback=self.parse)
可能是start_urls中的数组。我遇到了同样的问题,刚才我正在尝试不使用数组,目前没有发生错误。
设置
我正在使用 Scrapy 和 Python 3.7
为它的房屋广告抓取 this website每个房屋广告,我都会获得房屋的特征,例如尺寸、价格等。
该网站每页显示 10 个广告,因此我需要遍历页面。
我的代码
class DwellingSpider(scrapy.Spider):
name = 'dwelling'
start_urls = list(df['spitogatos_url'])
def parse(self, response):
# obtain ad list element
result_list = response.css('#searchDetailsListings')
# select ad from list
for ad in result_list.xpath('div'):
# here's the code to obtain characterstics per ad and yield them
# obtain next page url
next_page = response.css('#pagination > ul > li.next > a').xpath('@href').extract_first()
# send next page url to parse function
if len(next_page) > 0:
yield scrapy.Request(str(next_page), callback=self.parse)
其中 list(df['spitogatos_url'])
是一个包含我要抓取的 X url 的列表,看起来像
['https://en.spitogatos.gr/search/results/residential/sale/r194/m194m?ref=refinedSearchSR',
'https://en.spitogatos.gr/search/results/residential/sale/r153/m153m187m?ref=refinedSearchSR']
问题
获取每个广告作品的房屋特征。
问题在于GET
正确调整下一页,
[scrapy.core.scraper] ERROR: Spider error processing <GET https://en.spitogatos.gr/search/results/residential/sale/r194/m194m/offset_10> (referer: https://en.spitogatos.gr/search/results/residential/sale/r194/m194m?ref=refinedSearchSR)
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
我不确定是什么原因造成的 KeyError: 0
以及如何解决它。
有什么想法吗?
编辑
我发现如果我使用 next_page_url
作为起点,即
start_urls = ['https://en.spitogatos.gr/search/results/residential/sale/r177/m177m183m/offset_10']
我立即得到同样的错误,
ERROR: Spider error processing <GET https://en.spitogatos.gr/search/results/residential/sale/r177/m177m183m/offset_10> (referer: None)
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
试试这个
next_page = response.css('[rel="next"] ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(str(next_page), callback=self.parse)
可能是start_urls中的数组。我遇到了同样的问题,刚才我正在尝试不使用数组,目前没有发生错误。