Scrapy Spider 抓取错误数据
Scrapy Spider scraping wrong data
我正在 scrapy 上构建我的第一个蜘蛛,它旨在从博彩网站和 return 比赛球队的名称和赔率中抓取数据。我正在使用 for 循环遍历包含所有所需数据的 class,但代码对第一场比赛的数据进行了 9 次 return(有 9 场比赛)。我做错了什么?
import scrapy
class SportsBetSpider(scrapy.Spider):
name = "odds"
def start_requests(self):
urls = [
"https://www.sportsbet.com.au/betting/australian-rules/afl/round-3"
]
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
for post in response.css('li.cardOuterItem_fn8ai8t'):
yield{
'Team 1' : post.xpath('//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[0].get(),
'Odds 1' : post.xpath('//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[0].get(),
'Team 2' : post.xpath('//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[1].get(),
'Odds 2' : post.xpath('//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[1].get()
}
输出为:
2020-06-14 20:41:38 [scrapy.core.engine] DEBUG: Crawled (200) https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> (referer: None)
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 #Only 8 links allowed but same link as before>
{'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'}
2020-06-14 20:41:38 [scrapy.core.engine] INFO: Closing spider (finished)
当有9个夹具时,它只显示第一个夹具的抓取数据,这是如何解决的?
我找到问题了。我使用的是绝对 xpath 而不是相对 xpath。我通过以下方式解决了这个问题:
yield{
'Team 1' : post.xpath('.//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[0].get(),
'Odds 1' : post.xpath('.//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[0].get(),
'Team 2' : post.xpath('.//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[1].get(),
'Odds 2' : post.xpath('.//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[1].get()
}
使它 .// 而不是 // 解决了问题。
我正在 scrapy 上构建我的第一个蜘蛛,它旨在从博彩网站和 return 比赛球队的名称和赔率中抓取数据。我正在使用 for 循环遍历包含所有所需数据的 class,但代码对第一场比赛的数据进行了 9 次 return(有 9 场比赛)。我做错了什么?
import scrapy
class SportsBetSpider(scrapy.Spider):
name = "odds"
def start_requests(self):
urls = [
"https://www.sportsbet.com.au/betting/australian-rules/afl/round-3"
]
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
for post in response.css('li.cardOuterItem_fn8ai8t'):
yield{
'Team 1' : post.xpath('//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[0].get(),
'Odds 1' : post.xpath('//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[0].get(),
'Team 2' : post.xpath('//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[1].get(),
'Odds 2' : post.xpath('//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[1].get()
}
输出为:
2020-06-14 20:41:38 [scrapy.core.engine] DEBUG: Crawled (200) https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> (referer: None) 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sportsbet.com.au/betting/australian-rules/afl/round-3> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.scraper] DEBUG: Scraped from <200 #Only 8 links allowed but same link as before> {'Team 1': 'Richmond', 'Odds 1': '1.36', 'Team 2': 'Hawthorn', 'Odds 2': '3.08'} 2020-06-14 20:41:38 [scrapy.core.engine] INFO: Closing spider (finished)
当有9个夹具时,它只显示第一个夹具的抓取数据,这是如何解决的?
我找到问题了。我使用的是绝对 xpath 而不是相对 xpath。我通过以下方式解决了这个问题:
yield{
'Team 1' : post.xpath('.//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[0].get(),
'Odds 1' : post.xpath('.//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[0].get(),
'Team 2' : post.xpath('.//span[@class="size12_fq5j3k2 normal_fgzdi7m caption_f4zed5e"]/text()')[1].get(),
'Odds 2' : post.xpath('.//span[@class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9"]/text()')[1].get()
}
使它 .// 而不是 // 解决了问题。