Scrapy 初学者:无法从 css 选择器获取文本形式的数据,得到空数组
Scrapy Beginner: Not able to get data in text Form From css selector, got empty array
我是 Scrapy 新手。我试着抓取这个足球数据:相关 website
我想获得每个球员的球员位置 table 中有 25 名球员,但我得到 25 个空列表
下面是我的 css 选择器
for data in response.css('table.items>tbody>tr'):
print(data.css('table.items>tbody>tr>td:nth-of-type(2)>table.inline-table:nth-of-type(1)>tbody>tr:nth-of-type(2)>td::text').extract())
当我尝试在浏览器中检查它时,它定位到我想要的确切数据,但我无法进入 scarpy shell。
有什么解决办法吗?我被困了几个小时。
您在 css
中使用了太多元素 - 您应该使用更简单的东西,因为某些元素可能存在于浏览器的 DOM 树中(它显示在 DevTools 中)但实际上并不存在 HTML(从服务器获得)。 IE。 tbody
通常不存在于 HTML
这给了我结果
all_rows = response.css('table.items tr.odd, table.items tr.even')
print('len(all_rows):', len(all_rows))
for row in all_rows:
info = row.css('td a::text').extract()
print('info:', info)
position = row.css('table.inline-table td::text').extract()
print('position:', position[4])
像这样
len(all_rows): 25
info: ['Neymar', '17/18', 'Barcelona', 'LaLiga', 'Paris SG', 'Ligue 1', '€222.00m']
position: Left Winger
info: ['Kylian Mbappé', '18/19', 'Monaco', 'Ligue 1', 'Paris SG', 'Ligue 1', '€145.00m']
position: Centre-Forward
info: ['Philippe Coutinho', '17/18', 'Liverpool', 'Premier League', 'Barcelona', 'LaLiga', '€135.00m']
position: Attacking Midfield
info: ['Ousmane Dembélé', '17/18', 'Bor. Dortmund', 'Bundesliga', 'Barcelona', 'LaLiga', '€135.00m']
position: Right Winger
info: ['João Félix', '19/20', 'Benfica', 'Liga NOS', 'Atlético Madrid', 'LaLiga', '€127.20m']
position: Second Striker
info: ['Antoine Griezmann', '19/20', 'Atlético Madrid', 'LaLiga', 'Barcelona', 'LaLiga', '€120.00m']
position: Second Striker
info: ['Cristiano Ronaldo', '18/19', 'Real Madrid', 'LaLiga', 'Juventus', 'Serie A', '€117.00m']
position: Centre-Forward
info: ['Eden Hazard', '19/20', 'Chelsea', 'Premier League', 'Real Madrid', 'LaLiga', '€115.00m']
position: Left Winger
info: ['Paul Pogba', '16/17', 'Juventus', 'Serie A', 'Man Utd', 'Premier League', '€105.00m']
position: Central Midfield
info: ['Gareth Bale', '13/14', 'Spurs', 'Premier League', 'Real Madrid', 'LaLiga', '€101.00m']
position: Right Winger
info: ['Cristiano Ronaldo', '09/10', 'Man Utd', 'Premier League', 'Real Madrid', 'LaLiga', '€94.00m']
position: Centre-Forward
info: ['Gonzalo Higuaín', '16/17', 'SSC Napoli', 'Serie A', 'Juventus', 'Serie A', '€90.00m']
position: Centre-Forward
info: ['Neymar', '13/14', 'Santos FC', 'Série A', 'Barcelona', 'LaLiga', '€88.20m']
position: Left Winger
info: ['Harry Maguire', '19/20', 'Leicester', 'Premier League', 'Man Utd', 'Premier League', '€87.00m']
position: Centre-Back
info: ['Frenkie de Jong', '19/20', 'Ajax', 'Eredivisie', 'Barcelona', 'LaLiga', '€86.00m']
position: Central Midfield
info: ['Matthijs de Ligt', '19/20', 'Ajax', 'Eredivisie', 'Juventus', 'Serie A', '€85.50m']
position: Centre-Back
info: ['Romelu Lukaku', '17/18', 'Everton', 'Premier League', 'Man Utd', 'Premier League', '€84.70m']
position: Centre-Forward
info: ['Virgil van Dijk', '17/18', 'Southampton', 'Premier League', 'Liverpool', 'Premier League', '€84.65m']
position: Centre-Back
info: ['Luis Suárez', '14/15', 'Liverpool', 'Premier League', 'Barcelona', 'LaLiga', '€81.72m']
position: Centre-Forward
info: ['Kai Havertz', '20/21', 'Bay. Leverkusen', 'Bundesliga', 'Chelsea', 'Premier League', '€80.00m']
position: Attacking Midfield
info: ['Lucas Hernández', '19/20', 'Atlético Madrid', 'LaLiga', 'FC Bayern ', 'Bundesliga', '€80.00m']
position: Left-Back
info: ['Nicolas Pépé', '19/20', 'LOSC Lille', 'Ligue 1', 'Arsenal', 'Premier League', '€80.00m']
position: Right Winger
info: ['Kepa', '18/19', 'Athletic', 'LaLiga', 'Chelsea', 'Premier League', '€80.00m']
position: Goalkeeper
info: ['Zinédine Zidane', '01/02', 'Juventus', 'Serie A', 'Real Madrid', 'LaLiga', '€77.50m']
position: Attacking Midfield
info: ['Kevin De Bruyne', '15/16', 'VfL Wolfsburg', 'Bundesliga', 'Man City', 'Premier League', '€76.00m']
position: Attacking Midfield
每个人都可以复制到单个文件的完整工作代码 - 即。 script.py
- 和 运行 没有创建项目 - python script.py
.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=alle&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1']
def parse(self, response):
print('url:', response.url)
all_rows = response.css('table.items tr.odd, table.items tr.even')
print('len(all_rows):', len(all_rows))
for row in all_rows:
info = row.css('td a::text').extract()
print('info:', info)
position = row.css('table.inline-table td::text').extract()
print('position:', position[4])
# send to file `output.csv`
yield {
'name': info[0],
'season': info[1],
'left team': info[2],
'left league': info[3],
'joined team': info[4],
'joined league': info[5],
'value':info[6],
'position': position[4]
}
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# 'USER_AGENT': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()
我是 Scrapy 新手。我试着抓取这个足球数据:相关 website
我想获得每个球员的球员位置 table 中有 25 名球员,但我得到 25 个空列表
下面是我的 css 选择器
for data in response.css('table.items>tbody>tr'):
print(data.css('table.items>tbody>tr>td:nth-of-type(2)>table.inline-table:nth-of-type(1)>tbody>tr:nth-of-type(2)>td::text').extract())
当我尝试在浏览器中检查它时,它定位到我想要的确切数据,但我无法进入 scarpy shell。 有什么解决办法吗?我被困了几个小时。
您在 css
中使用了太多元素 - 您应该使用更简单的东西,因为某些元素可能存在于浏览器的 DOM 树中(它显示在 DevTools 中)但实际上并不存在 HTML(从服务器获得)。 IE。 tbody
通常不存在于 HTML
这给了我结果
all_rows = response.css('table.items tr.odd, table.items tr.even')
print('len(all_rows):', len(all_rows))
for row in all_rows:
info = row.css('td a::text').extract()
print('info:', info)
position = row.css('table.inline-table td::text').extract()
print('position:', position[4])
像这样
len(all_rows): 25
info: ['Neymar', '17/18', 'Barcelona', 'LaLiga', 'Paris SG', 'Ligue 1', '€222.00m']
position: Left Winger
info: ['Kylian Mbappé', '18/19', 'Monaco', 'Ligue 1', 'Paris SG', 'Ligue 1', '€145.00m']
position: Centre-Forward
info: ['Philippe Coutinho', '17/18', 'Liverpool', 'Premier League', 'Barcelona', 'LaLiga', '€135.00m']
position: Attacking Midfield
info: ['Ousmane Dembélé', '17/18', 'Bor. Dortmund', 'Bundesliga', 'Barcelona', 'LaLiga', '€135.00m']
position: Right Winger
info: ['João Félix', '19/20', 'Benfica', 'Liga NOS', 'Atlético Madrid', 'LaLiga', '€127.20m']
position: Second Striker
info: ['Antoine Griezmann', '19/20', 'Atlético Madrid', 'LaLiga', 'Barcelona', 'LaLiga', '€120.00m']
position: Second Striker
info: ['Cristiano Ronaldo', '18/19', 'Real Madrid', 'LaLiga', 'Juventus', 'Serie A', '€117.00m']
position: Centre-Forward
info: ['Eden Hazard', '19/20', 'Chelsea', 'Premier League', 'Real Madrid', 'LaLiga', '€115.00m']
position: Left Winger
info: ['Paul Pogba', '16/17', 'Juventus', 'Serie A', 'Man Utd', 'Premier League', '€105.00m']
position: Central Midfield
info: ['Gareth Bale', '13/14', 'Spurs', 'Premier League', 'Real Madrid', 'LaLiga', '€101.00m']
position: Right Winger
info: ['Cristiano Ronaldo', '09/10', 'Man Utd', 'Premier League', 'Real Madrid', 'LaLiga', '€94.00m']
position: Centre-Forward
info: ['Gonzalo Higuaín', '16/17', 'SSC Napoli', 'Serie A', 'Juventus', 'Serie A', '€90.00m']
position: Centre-Forward
info: ['Neymar', '13/14', 'Santos FC', 'Série A', 'Barcelona', 'LaLiga', '€88.20m']
position: Left Winger
info: ['Harry Maguire', '19/20', 'Leicester', 'Premier League', 'Man Utd', 'Premier League', '€87.00m']
position: Centre-Back
info: ['Frenkie de Jong', '19/20', 'Ajax', 'Eredivisie', 'Barcelona', 'LaLiga', '€86.00m']
position: Central Midfield
info: ['Matthijs de Ligt', '19/20', 'Ajax', 'Eredivisie', 'Juventus', 'Serie A', '€85.50m']
position: Centre-Back
info: ['Romelu Lukaku', '17/18', 'Everton', 'Premier League', 'Man Utd', 'Premier League', '€84.70m']
position: Centre-Forward
info: ['Virgil van Dijk', '17/18', 'Southampton', 'Premier League', 'Liverpool', 'Premier League', '€84.65m']
position: Centre-Back
info: ['Luis Suárez', '14/15', 'Liverpool', 'Premier League', 'Barcelona', 'LaLiga', '€81.72m']
position: Centre-Forward
info: ['Kai Havertz', '20/21', 'Bay. Leverkusen', 'Bundesliga', 'Chelsea', 'Premier League', '€80.00m']
position: Attacking Midfield
info: ['Lucas Hernández', '19/20', 'Atlético Madrid', 'LaLiga', 'FC Bayern ', 'Bundesliga', '€80.00m']
position: Left-Back
info: ['Nicolas Pépé', '19/20', 'LOSC Lille', 'Ligue 1', 'Arsenal', 'Premier League', '€80.00m']
position: Right Winger
info: ['Kepa', '18/19', 'Athletic', 'LaLiga', 'Chelsea', 'Premier League', '€80.00m']
position: Goalkeeper
info: ['Zinédine Zidane', '01/02', 'Juventus', 'Serie A', 'Real Madrid', 'LaLiga', '€77.50m']
position: Attacking Midfield
info: ['Kevin De Bruyne', '15/16', 'VfL Wolfsburg', 'Bundesliga', 'Man City', 'Premier League', '€76.00m']
position: Attacking Midfield
每个人都可以复制到单个文件的完整工作代码 - 即。 script.py
- 和 运行 没有创建项目 - python script.py
.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=alle&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1']
def parse(self, response):
print('url:', response.url)
all_rows = response.css('table.items tr.odd, table.items tr.even')
print('len(all_rows):', len(all_rows))
for row in all_rows:
info = row.css('td a::text').extract()
print('info:', info)
position = row.css('table.inline-table td::text').extract()
print('position:', position[4])
# send to file `output.csv`
yield {
'name': info[0],
'season': info[1],
'left team': info[2],
'left league': info[3],
'joined team': info[4],
'joined league': info[5],
'value':info[6],
'position': position[4]
}
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# 'USER_AGENT': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()