使用 Python 抓取 HTML 中的特定元素:BeautifulSoup4

Grab specific element in HTML using Python : BeautifulSoup4

我正在使用 beautifulsoup 抓取网站数据。我正在了解如何抓取网页上显示的内容,但是,我想抓取的 html 中嵌入了一个没有标题的唯一标识符。例如:

<tbody><tr ><th scope="row" class="right " data-stat="ranker" csk="1" >1</th><td class="left " data-stat="pos" csk="1" ><strong>C</strong></td><td class="left " data-append-csv="mccanja02" data-stat="player" csk="McCann,James" ><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td><td class="right " data-stat="age" >32</td><td class="right " data-stat="G" >13</td><td class="right " data-stat="PA" >42</td><td class="right " data-stat="AB" >36</td><td class="right " data-stat="R" >5</td><td class="right " data-stat="H" >7</td><td class="right " data-stat="2B" >2</td><td class="right iz" data-stat="3B" >0</td><td class="right " data-stat="HR" >1</td><td class="right " data-stat="RBI" >5</td><td class="right " data-stat="SB" >1</td><td class="right iz" data-stat="CS" >0</td><td class="right " data-stat="BB" >2</td><td class="right " data-stat="SO" >7</td><td class="right " data-stat="batting_avg" >.194</td><td class="right " data-stat="onbase_perc" >.286</td><td class="right " data-stat="slugging_perc" >.333</td><td class="right " data-stat="onbase_plus_slugging" >.619</td><td class="right " data-stat="onbase_plus_slugging_plus" >87</td><td class="right " data-stat="TB" >12</td><td class="right " data-stat="GIDP" >1</td><td class="right " data-stat="HBP" >3</td><td class="right iz" data-stat="SH" >0</td><td class="right " data-stat="SF" >1</td><td class="right iz" data-stat="IBB" >0</td></tr>

我只想获取“mccanja02”,因为它可用于添加到 URL 并指向玩家特定页面。到目前为止,我已经尝试过这样的事情:

# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
    playerUID = rowUID.find('td', {'data-append-csv'})
    if playerUID:
        playerUID = playerUID.text
        print(playerUID)

但是没有标题可以与之关联,就像我想获取玩家的名字一样:

# grab players name
rows = tableTeamBatting.find_all('tr')
for row in rows:
    players = []
    player = row.find('td', {'data-stat' : 'player'})
    if player:
        player = player.text
        print(player)

我无法准确输出@F.Hoque的解决方案,所以我做了这个怪物:

# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
    playerUID = rowUID.select('a[href]')
    playerUID = playerUID if playerUID else None
    if playerUID == None:
        continue
    else:
        pUID = str(playerUID)
        pUID = pUID.split('/')
        for p in range(len(pUID)):
            if '.shtml' in pUID[p]:
                stor = pUID[p].split('.shtml')
                print(stor[0])

这给了我正在寻找的 pUID。我无法使用评论中的代码的原因是因为它会 return this:

<td class="left" csk="McCann,James" data-append-csv="mccanja02" data-stat="player"><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td>
<td class="left" csk="Alonso,Pete" data-append-csv="alonspe01" data-stat="player"><strong><a href="/players/a/alonspe01.shtml">Pete Alonso</a></strong></td>
<td class="left" csk="McNeil,Jeff" data-append-csv="mcneije01" data-stat="player"><strong><a href="/players/m/mcneije01.shtml">Jeff McNeil</a>*</strong></td>
<td class="left" csk="Lindor,Francisco" data-append-csv="lindofr01" data-stat="player"><strong><a href="/players/l/lindofr01.shtml">Francisco Lindor</a>#</strong></td>...

我只是在寻找 data-append-csv=pUID。不过,我很感谢您的帮助,我深入研究了一些文档并找到了一些东西。我乐于接受有关如何改进这一点的任何建议。

mccanja02data-append-csv的一个属性值。所以不能调用.text来抢。您可以使用 css 选择器获取它,如下所示:

html='''
<html>
 <body>
  <tbody>
   <tr>
    <th class="right" csk="1" data-stat="ranker" scope="row">
     1
    </th>
    <td class="left" csk="1" data-stat="pos">
     <strong>
      C
     </strong>
    </td>
    <td class="left" csk="McCann,James" data-append-csv="mccanja02" data-stat="player">       
     <strong>
      <a href="/players/m/mccanja02.shtml">
       James McCann
      </a>
     </strong>
    </td>
    <td class="right" data-stat="age">
     32
    </td>
    <td class="right" data-stat="G">
     13
    </td>
    <td class="right" data-stat="PA">
     42
    </td>
    <td class="right" data-stat="AB">
     36
    </td>
    <td class="right" data-stat="R">
     5
    </td>
    <td class="right" data-stat="H">
     7
    </td>
    <td class="right" data-stat="2B">
     2
    </td>
    <td class="right iz" data-stat="3B">
     0
    </td>
    <td class="right" data-stat="HR">
     1
    </td>
    <td class="right" data-stat="RBI">
     5
    </td>
    <td class="right" data-stat="SB">
     1
    </td>
    <td class="right iz" data-stat="CS">
     0
    </td>
    <td class="right" data-stat="BB">
     2
    </td>
    <td class="right" data-stat="SO">
     7
    </td>
    <td class="right" data-stat="batting_avg">
     .194
    </td>
    <td class="right" data-stat="onbase_perc">
     .286
    </td>
    <td class="right" data-stat="slugging_perc">
     .333
    </td>
    <td class="right" data-stat="onbase_plus_slugging">
     .619
    </td>
    <td class="right" data-stat="onbase_plus_slugging_plus">
     87
    </td>
    <td class="right" data-stat="TB">
     12
    </td>
    <td class="right" data-stat="GIDP">
     1
    </td>
    <td class="right" data-stat="HBP">
     3
    </td>
    <td class="right iz" data-stat="SH">
     0
    </td>
    <td class="right" data-stat="SF">
     1
    </td>
    <td class="right iz" data-stat="IBB">
     0
    </td>
   </tr>
  </tbody>
 </body>
</html>
'''

from bs4 import BeautifulSoup
tableTeamBatting=BeautifulSoup(html,'lxml')
#print(soup.prettify())

rowsUID = tableTeamBatting.select('tr')
for rowUID in rowsUID:
    playerUID = rowUID.select_one('td[data-append-csv]')
    playerUID = playerUID.get('data-append-csv')if playerUID else None

    print(playerUID)

     

输出:

mccanja02

我认为您可以将 data-stat 属性用于 select 由击球手 id 锚定的 tbody 子项。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.baseball-reference.com/teams/NYM/2022.shtml', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
players = {i['csk']:i['data-append-csv'] for i in soup.select('#div_team_batting tbody td[data-stat="player"]')}

更好的方法是使用玩家的 ID 并通过注册页面(重定向到玩家的 url)。您可以从 the sup_players_search_list.csv.

的一个请求中获取全部

那么就按照https://www.baseball-reference.com/register/player.fcgi?id=的格式。

例如 James McCann 'mccanja02' playerId 是 'mccann002jam'

代码:

import pandas as pd

player_df = pd.read_csv('https://www.baseball-reference.com/short/inc/sup_players_search_list.csv', header=None)
player_df = player_df.rename(columns={0:'id'})

playerId = 'mccann002jam'
baseUrl = 'https://www.baseball-reference.com/register/player.fcgi?id='

url = f'{baseUrl}{playerId}'

# Get the data
response = requests.get(url)
html = response.text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')

tables_dict = {}
tables = soup.find_all('table')
for table in tables:
    stat_type = table.find('caption').text.strip()
    df = pd.read_html(str(table))[0]
    
    tables_dict[stat_type] = df

for tableName, table in tables_dict.items():
    print(f'\n\n*** {tableName} ***')
    print(table)

输出:

csv:

print(player_df)
                 id                1          2  3   4   5   6   7        8
0      aardsm001dav    David Aardsma  2004-2015  0 NaN NaN NaN NaN  2025.78
1      aaron-001hen      Henry Aaron  1954-1976  0 NaN NaN NaN NaN  2142.01
2      aaron-001tom     Tommie Aaron  1962-1971  0 NaN NaN NaN NaN  1975.23
3      aase--001don         Don Aase  1977-1990  0 NaN NaN NaN NaN  2018.02
4      abad--001fau        Andy Abad  2001-2006  0 NaN NaN NaN NaN  2008.65
            ...              ...        ... ..  ..  ..  ..  ..      ...
22671  zupo--001fra       Frank Zupo  1957-1961  0 NaN NaN NaN NaN  1963.93
22672  zuvell001pau     Paul Zuvella  1982-1991  0 NaN NaN NaN NaN  1997.64
22673  zuveri001geo  George Zuverink  1951-1959  0 NaN NaN NaN NaN  1975.83
22674  zwilli001edw   Dutch Zwilling  1910-1916  0 NaN NaN NaN NaN  1928.49
22675  zych--001ton        Tony Zych  2015-2017  0 NaN NaN NaN NaN  2021.59

[22676 rows x 9 columns]

詹姆斯·麦肯的数据:

*** Futures Game ***
                   0     1
0  2013 Futures Game  U.S.


*** Register Batting ***
                       Year                      Age  ...   SF  IBB
0                      2010                       20  ...    0    0
1                      2011                       21  ...    2  NaN
2                      2011                       21  ...    1    0
3                      2011                       21  ...    1    0
4                      2011                       21  ...    0    0
5                      2012                       22  ...    3    1
6                      2012                       22  ...    1    0
7                      2012                       22  ...    2    1
8                      2012                       22  ...    0    0
9                      2013                       23  ...    7    1
10                  2013-14                       23  ...    0    0
11                     2014                       24  ...    7    0
12                     2014                       24  ...    0    0
13                     2015                       25  ...    1    0
14                     2016                       26  ...    0    0
15                     2016                       26  ...    3    0
16                     2017                       27  ...    0    0
17                     2017                       27  ...    3    0
18                     2018                       28  ...    2    0
19                     2019                       29  ...    0    1
20                     2020                       30  ...    2    0
21                     2021                       31  ...    3    1
22                     2022                       32  ...    2    0
23                     Year                      Age  ...   SF  IBB
24               (1 season)               (1 season)  ...    0    0
25       Majors (9 seasons)       Majors (9 seasons)  ...   16    2
26       Minors (6 seasons)       Minors (6 seasons)  ...   18    2
27       Foreign (1 season)       Foreign (1 season)  ...    0    0
28       College (1 season)       College (1 season)  ...    2  NaN
29         Other (1 season)         Other (1 season)  ...    0    0
30  All Levels (13 Seasons)  All Levels (13 Seasons)  ...   36    4
31                      NaN                      NaN  ...  NaN  NaN
32          AAA (3 seasons)          AAA (3 seasons)  ...    7    0
33           AA (2 seasons)           AA (2 seasons)  ...    8    1
34            A+ (1 season)            A+ (1 season)  ...    2    1
35             A (1 season)             A (1 season)  ...    1    0
36            Rk (1 season)            Rk (1 season)  ...    0    0

[37 rows x 30 columns]


*** Register Fielding ***
                       Year                      Age  ...  lgCS% PO.1
0                      2010                       20  ...    NaN  NaN
1                      2011                       21  ...    NaN  NaN
2                      2011                       21  ...    NaN  NaN
3                      2011                       21  ...    NaN  NaN
4                      2012                       22  ...    NaN  NaN
5                      2012                       22  ...    NaN  NaN
6                      2012                       22  ...    NaN  NaN
7                      2012                       22  ...    NaN  NaN
8                      2013                       23  ...    NaN  NaN
9                   2013-14                       23  ...    NaN  NaN
10                     2014                       24  ...    NaN  NaN
11                     2014                       24  ...    NaN  NaN
12                     2014                       24  ...    NaN  NaN
13                     2015                       25  ...    NaN  NaN
14                     2016                       26  ...    NaN  NaN
15                     2016                       26  ...    NaN  NaN
16                     2017                       27  ...    NaN  NaN
17                     2017                       27  ...    NaN  NaN
18                     2018                       28  ...    NaN  NaN
19                     2019                       29  ...    NaN  NaN
20                     2020                       30  ...    NaN  NaN
21                     2021                       31  ...    NaN  NaN
22                     2021                       31  ...    NaN  NaN
23                     2022                       32  ...    NaN  NaN
24                     Year                      Age  ...  lgCS%   PO
25               (1 season)               (1 season)  ...    NaN  NaN
26        Majors (1 season)        Majors (1 season)  ...    NaN  NaN
27       Majors (9 seasons)       Majors (9 seasons)  ...    NaN  NaN
28        Minors (1 season)        Minors (1 season)  ...    NaN  NaN
29       Minors (6 seasons)       Minors (6 seasons)  ...    NaN  NaN
30       Foreign (1 season)       Foreign (1 season)  ...    NaN  NaN
31         Other (1 season)         Other (1 season)  ...    NaN  NaN
32    All Levels (1 Season)    All Levels (1 Season)  ...    NaN  NaN
33    All Levels (1 Season)    All Levels (1 Season)  ...    NaN  NaN
34  All Levels (13 Seasons)  All Levels (13 Seasons)  ...    NaN  NaN

[35 rows x 26 columns]


*** Teams Played For ***
    Year  Age                       Tm  ... Stint        From          To
0   2010   20         Cotuit Kettleers  ...   NaN  2010-06-26  2010-08-04
1   2011   21      Arkansas Razorbacks  ...   NaN         NaN         NaN
2   2011   21               GCL Tigers  ...   NaN  2011-08-13  2011-08-19
3   2011   21  West Michigan Whitecaps  ...   NaN  2011-08-21  2011-09-03
4   2012   22   Lakeland Flying Tigers  ...   NaN  2012-04-05  2012-06-04
5   2012   22           Erie SeaWolves  ...   NaN  2012-06-07  2012-09-03
6   2012   22           Mesa Solar Sox  ...   NaN  2012-10-10  2012-11-15
7   2013   23           Erie SeaWolves  ...   NaN  2013-04-04  2013-09-02
8   2013   23      Leones del Escogido  ...   NaN  2013-10-19  2013-11-16
9   2014   24          Toledo Mud Hens  ...   NaN  2014-04-04  2014-08-29
10  2014   24           Detroit Tigers  ...   1.0  2014-09-01  2014-09-27
11  2015   25           Detroit Tigers  ...   1.0  2015-04-08  2015-10-04
12  2016   26           Detroit Tigers  ...   1.0  2016-04-05  2016-10-02
13  2016   26          Toledo Mud Hens  ...   NaN  2016-04-26  2016-05-01
14  2017   27           Detroit Tigers  ...   1.0  2017-04-04  2017-09-30
15  2017   27          Toledo Mud Hens  ...   NaN  2017-06-06  2017-06-07
16  2018   28           Detroit Tigers  ...   1.0  2018-03-30  2018-09-30
17  2019   29        Chicago White Sox  ...   1.0  2019-03-28  2019-09-28
18  2020   30        Chicago White Sox  ...   1.0  2020-07-25  2020-09-26
19  2021   31            New York Mets  ...   1.0  2021-04-05  2021-10-02
20  2022   32            New York Mets  ...   1.0  2022-04-07  2022-05-10

[21 rows x 9 columns]