Scrapy 初学者：无法从 css 选择器获取文本形式的数据，得到空数组

Question

我是 Scrapy 新手。我试着抓取这个足球数据：相关 website

我想获得每个球员的球员位置 table 中有 25 名球员，但我得到 25 个空列表

下面是我的 css 选择器

 for data in response.css('table.items>tbody>tr'):
            print(data.css('table.items>tbody>tr>td:nth-of-type(2)>table.inline-table:nth-of-type(1)>tbody>tr:nth-of-type(2)>td::text').extract())

当我尝试在浏览器中检查它时，它定位到我想要的确切数据，但我无法进入 scarpy shell。有什么解决办法吗？我被困了几个小时。

Answer 1

您在 css 中使用了太多元素 - 您应该使用更简单的东西，因为某些元素可能存在于浏览器的 DOM 树中（它显示在 DevTools 中）但实际上并不存在 HTML（从服务器获得）。 IE。 tbody 通常不存在于 HTML

这给了我结果

    all_rows = response.css('table.items tr.odd, table.items tr.even')
    print('len(all_rows):', len(all_rows))
    
    for row in all_rows:
        info = row.css('td a::text').extract()
        print('info:', info)
        position = row.css('table.inline-table td::text').extract()
        print('position:', position[4])

像这样

len(all_rows): 25

info: ['Neymar', '17/18', 'Barcelona', 'LaLiga', 'Paris SG', 'Ligue 1', '€222.00m']
position: Left Winger
info: ['Kylian Mbappé', '18/19', 'Monaco', 'Ligue 1', 'Paris SG', 'Ligue 1', '€145.00m']
position: Centre-Forward
info: ['Philippe Coutinho', '17/18', 'Liverpool', 'Premier League', 'Barcelona', 'LaLiga', '€135.00m']
position: Attacking Midfield
info: ['Ousmane Dembélé', '17/18', 'Bor. Dortmund', 'Bundesliga', 'Barcelona', 'LaLiga', '€135.00m']
position: Right Winger
info: ['João Félix', '19/20', 'Benfica', 'Liga NOS', 'Atlético Madrid', 'LaLiga', '€127.20m']
position: Second Striker
info: ['Antoine Griezmann', '19/20', 'Atlético Madrid', 'LaLiga', 'Barcelona', 'LaLiga', '€120.00m']
position: Second Striker
info: ['Cristiano Ronaldo', '18/19', 'Real Madrid', 'LaLiga', 'Juventus', 'Serie A', '€117.00m']
position: Centre-Forward
info: ['Eden Hazard', '19/20', 'Chelsea', 'Premier League', 'Real Madrid', 'LaLiga', '€115.00m']
position: Left Winger
info: ['Paul Pogba', '16/17', 'Juventus', 'Serie A', 'Man Utd', 'Premier League', '€105.00m']
position: Central Midfield
info: ['Gareth Bale', '13/14', 'Spurs', 'Premier League', 'Real Madrid', 'LaLiga', '€101.00m']
position: Right Winger
info: ['Cristiano Ronaldo', '09/10', 'Man Utd', 'Premier League', 'Real Madrid', 'LaLiga', '€94.00m']
position: Centre-Forward
info: ['Gonzalo Higuaín', '16/17', 'SSC Napoli', 'Serie A', 'Juventus', 'Serie A', '€90.00m']
position: Centre-Forward
info: ['Neymar', '13/14', 'Santos FC', 'Série A', 'Barcelona', 'LaLiga', '€88.20m']
position: Left Winger
info: ['Harry Maguire', '19/20', 'Leicester', 'Premier League', 'Man Utd', 'Premier League', '€87.00m']
position: Centre-Back
info: ['Frenkie de Jong', '19/20', 'Ajax', 'Eredivisie', 'Barcelona', 'LaLiga', '€86.00m']
position: Central Midfield
info: ['Matthijs de Ligt', '19/20', 'Ajax', 'Eredivisie', 'Juventus', 'Serie A', '€85.50m']
position: Centre-Back
info: ['Romelu Lukaku', '17/18', 'Everton', 'Premier League', 'Man Utd', 'Premier League', '€84.70m']
position: Centre-Forward
info: ['Virgil van Dijk', '17/18', 'Southampton', 'Premier League', 'Liverpool', 'Premier League', '€84.65m']
position: Centre-Back
info: ['Luis Suárez', '14/15', 'Liverpool', 'Premier League', 'Barcelona', 'LaLiga', '€81.72m']
position: Centre-Forward
info: ['Kai Havertz', '20/21', 'Bay. Leverkusen', 'Bundesliga', 'Chelsea', 'Premier League', '€80.00m']
position: Attacking Midfield
info: ['Lucas Hernández', '19/20', 'Atlético Madrid', 'LaLiga', 'FC Bayern ', 'Bundesliga', '€80.00m']
position: Left-Back
info: ['Nicolas Pépé', '19/20', 'LOSC Lille', 'Ligue 1', 'Arsenal', 'Premier League', '€80.00m']
position: Right Winger
info: ['Kepa', '18/19', 'Athletic', 'LaLiga', 'Chelsea', 'Premier League', '€80.00m']
position: Goalkeeper
info: ['Zinédine Zidane', '01/02', 'Juventus', 'Serie A', 'Real Madrid', 'LaLiga', '€77.50m']
position: Attacking Midfield
info: ['Kevin De Bruyne', '15/16', 'VfL Wolfsburg', 'Bundesliga', 'Man City', 'Premier League', '€76.00m']
position: Attacking Midfield

每个人都可以复制到单个文件的完整工作代码 - 即。 script.py - 和运行没有创建项目 - python script.py.

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=alle&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1']

    def parse(self, response):
        print('url:', response.url)

        all_rows = response.css('table.items tr.odd, table.items tr.even')
        print('len(all_rows):', len(all_rows))
        
        for row in all_rows:
            info = row.css('td a::text').extract()
            print('info:', info)
            position = row.css('table.inline-table td::text').extract()
            print('position:', position[4])
            
            # send to file `output.csv`
            yield {
                'name': info[0],
                'season': info[1],
                'left team': info[2],
                'left league': info[3],
                'joined team': info[4],
                'joined league': info[5],
                'value':info[6],
                'position': position[4]
            }
            
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
#    'USER_AGENT': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

Scrapy 初学者：无法从 css 选择器获取文本形式的数据，得到空数组

Scrapy Beginner: Not able to get data in text Form From css selector, got empty array

python

css-selectors

scrapy