scrapy xpath return 空
scrapy xpath return empty
我正在尝试使用 Scrapy 抓取奥运会的赛事列表。我很确定我的 XPath 是正确的。但它总是 return 最后一个空列表。欢迎提出任何建议。谢谢
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Selector
from eventSpider.items import EventspiderItem
class EventsSpider(scrapy.Spider):
name = 'eventsSpider'
def start_requests(self):
start_urls = [
'https://olympics.com/en/olympic-games'
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
eventsUrls = response.xpath("//div[@id='olympic-all-games']/div[1]/a/@href").extract()
print("event url is: {}".format(eventsUrls))
return eventsUrls
预期输出:
link 到 individual 事件(link 到 2020 年东京奥运会,2016 年里约奥运会)
编辑:
正如您在图片中看到的那样,div[@id='olympic-all-games] 就在那里。它有助于限制我们收到的 link 的数量,因为使用 //*[@data-cy="next-link"] 会 return 可能会收到不同类型的 links。但不知何故 scrapy 无法识别 div[@id='olympic-all-games]
您的选择器不正确。试试下面的代码。请注意,我通过删除未使用的导入和不需要的函数简化了您的代码(至少对于您共享的代码段)。
import scrapy
class EventsSpider(scrapy.Spider):
name = 'eventsSpider'
start_urls = ['https://olympics.com/en/olympic-games']
def parse(self, response):
for item in response.xpath("//*[@data-cy='next-link']"):
yield {
'name': item.xpath("./text()").get(),
'link': item.xpath("./@href").get()
}
如果我将上面的代码保存在一个名为 olympics.py
的文件中,并且 运行 带有 scrapy runspider olympics.py
的蜘蛛,我得到以下输出。
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'Paris 2024', 'link': '/en/olympic-games/paris-2024'}
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'Milano Cortina 2026', 'link': '/en/olympic-games/milano-cortina-2026'}
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'LA 2028', 'link': '/en/olympic-games/los-angeles-2028'}
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'Brisbane 2032', 'link': '/en/olympic-games/brisbane-2032'}
这是 json 和 scrapy 的解决方案:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Selector
from eventSpider.items import EventspiderItem
import json
class EventsSpider(scrapy.Spider):
name = 'eventsSpider'
def start_requests(self):
start_urls = [
'https://olympics.com/en/olympic-games'
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
data = response.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
json_data = json.loads(data)
eventsUrls = []
for game in json_data['props']['pageProps']['olympicGamesNoYog']: # all the games from 2020 to 1896
eventsUrls.append(game['meta']['url'])
print(f"event url is: {eventsUrls}")
return {'eventsUrls': eventsUrls[:10]} # return the last ten games urls
我正在尝试使用 Scrapy 抓取奥运会的赛事列表。我很确定我的 XPath 是正确的。但它总是 return 最后一个空列表。欢迎提出任何建议。谢谢
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Selector
from eventSpider.items import EventspiderItem
class EventsSpider(scrapy.Spider):
name = 'eventsSpider'
def start_requests(self):
start_urls = [
'https://olympics.com/en/olympic-games'
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
eventsUrls = response.xpath("//div[@id='olympic-all-games']/div[1]/a/@href").extract()
print("event url is: {}".format(eventsUrls))
return eventsUrls
预期输出: link 到 individual 事件(link 到 2020 年东京奥运会,2016 年里约奥运会)
编辑:
正如您在图片中看到的那样,div[@id='olympic-all-games] 就在那里。它有助于限制我们收到的 link 的数量,因为使用 //*[@data-cy="next-link"] 会 return 可能会收到不同类型的 links。但不知何故 scrapy 无法识别 div[@id='olympic-all-games]
您的选择器不正确。试试下面的代码。请注意,我通过删除未使用的导入和不需要的函数简化了您的代码(至少对于您共享的代码段)。
import scrapy
class EventsSpider(scrapy.Spider):
name = 'eventsSpider'
start_urls = ['https://olympics.com/en/olympic-games']
def parse(self, response):
for item in response.xpath("//*[@data-cy='next-link']"):
yield {
'name': item.xpath("./text()").get(),
'link': item.xpath("./@href").get()
}
如果我将上面的代码保存在一个名为 olympics.py
的文件中,并且 运行 带有 scrapy runspider olympics.py
的蜘蛛,我得到以下输出。
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'Paris 2024', 'link': '/en/olympic-games/paris-2024'}
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'Milano Cortina 2026', 'link': '/en/olympic-games/milano-cortina-2026'}
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'LA 2028', 'link': '/en/olympic-games/los-angeles-2028'}
2021-12-15 05:18:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://olympics.com/en/olympic-games>
{'name': 'Brisbane 2032', 'link': '/en/olympic-games/brisbane-2032'}
这是 json 和 scrapy 的解决方案:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Selector
from eventSpider.items import EventspiderItem
import json
class EventsSpider(scrapy.Spider):
name = 'eventsSpider'
def start_requests(self):
start_urls = [
'https://olympics.com/en/olympic-games'
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
data = response.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
json_data = json.loads(data)
eventsUrls = []
for game in json_data['props']['pageProps']['olympicGamesNoYog']: # all the games from 2020 to 1896
eventsUrls.append(game['meta']['url'])
print(f"event url is: {eventsUrls}")
return {'eventsUrls': eventsUrls[:10]} # return the last ten games urls