使用 python - table 和多个 tbody 元素进行网页抓取

Question

我正在尝试使用 Python 和 BeautifulSoup 从 the top table on this page（“2021-2022 常规赛季球员统计数据”）中抓取数据。该页面显示 100 名 NHL 球员的统计数据，每行 1 名球员。下面的代码有效，但问题是它只将前十行拉入数据框。这是因为每十行在一个单独的 <tbody> 中，所以它只遍历第一个 <tbody> 中的行。我怎样才能让它继续浏览页面上其余的 <tbody> 元素？

另一个问题：这个 table 总共有大约 1000 行，每页最多只显示 100 行。有没有办法重写下面的代码以一次遍历整个 table 而不是只遍历页面上显示的 100 行？

    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    
    url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'

    source = requests.get(url).text
    soup = BeautifulSoup(source,'html.parser')

    table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')

    df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])

    for row in table.tbody.find_all('tr'):
        columns = row.find_all('td')

        Player = columns[1].text.strip()
        Team = columns[2].text.strip()
        GamesPlayed = columns[3].text.strip()
        Goals = columns[4].text.strip()
        Assists = columns[5].text.strip()
        TotalPoints = columns[6].text.strip()
        PointsPerGame = columns[7].text.strip()
        PIM = columns[8].text.strip()
        PM = columns[9].text.strip()

        df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)

Answer 1

要将所有玩家统计数据加载到数据框中并将其保存到 csv，您可以使用下一个示例：

import requests
import pandas as pd
from bs4 import BeautifulSoup


dfs = []
for page in range(1, 11):
    url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
    print(f"Loading {url=}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    df = (
        pd.read_html(str(soup.select_one(".player-stats")))[0]
        .dropna(how="all")
        .reset_index(drop=True)
    )
    dfs.append(df)

df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)

打印：

...

1132  973.0             Austin Poganski (RW)          Winnipeg Jets    16     0     0     0  0.00      7  -3.0
1133  974.0             Mikhail Maltsev (LW)     Colorado Avalanche    18     0     0     0  0.00      2  -5.0
1134  975.0            Mason Geertsen (D/LW)      New Jersey Devils    23     0     0     0  0.00     62  -4.0
1135  976.0                  Jack McBain (C)        Arizona Coyotes     -     -     -     -     -      -   NaN
1136  977.0                Jordan Harris (D)     Montréal Canadiens     -     -     -     -     -      -   NaN
1137  978.0              Nikolai Knyzhov (D)        San Jose Sharks     -     -     -     -     -      -   NaN
1138  979.0              Marc McLaughlin (C)          Boston Bruins     -     -     -     -     -      -   NaN
1139  980.0                Carson Meyer (RW)  Columbus Blue Jackets     -     -     -     -     -      -   NaN
1140  981.0                 Leon Gawanke (D)          Winnipeg Jets     -     -     -     -     -      -   NaN
1141  982.0                 Brady Keeper (D)      Vancouver Canucks     -     -     -     -     -      -   NaN
1142  983.0                  Miles Wood (LW)      New Jersey Devils     -     -     -     -     -      -   NaN
1143  984.0              Samuel Morin (D/LW)    Philadelphia Flyers     -     -     -     -     -      -   NaN
1144  985.0               Connor Carrick (D)         Seattle Kraken     -     -     -     -     -      -   NaN
1145  986.0          Micheal Ferland (LW/RW)      Vancouver Canucks     -     -     -     -     -      -   NaN
1146  987.0                Jake Gardiner (D)    Carolina Hurricanes     -     -     -     -     -      -   NaN
1147  988.0                Oscar Klefbom (D)        Edmonton Oilers     -     -     -     -     -      -   NaN
1148  989.0                   Shea Weber (D)     Montréal Canadiens     -     -     -     -     -      -   NaN
1149  990.0            Brandon Sutter (C/RW)      Vancouver Canucks     -     -     -     -     -      -   NaN
1150  991.0               Brent Seabrook (D)    Tampa Bay Lightning     -     -     -     -     -      -   NaN

并保存 data.csv（来自 LibreOffice 的屏幕截图）：

使用 python - table 和多个 tbody 元素进行网页抓取

Web scraping with python - table with mutliple tbody elements

python

beautifulsoup

web-scraping

pandas