使用 python - table 和多个 tbody 元素进行网页抓取
Web scraping with python - table with mutliple tbody elements
我正在尝试使用 Python 和 BeautifulSoup 从 the top table on this page(“2021-2022 常规赛季球员统计数据”)中抓取数据。该页面显示 100 名 NHL 球员的统计数据,每行 1 名球员。下面的代码有效,但问题是它只将前十行拉入数据框。这是因为每十行在一个单独的 <tbody>
中,所以它只遍历第一个 <tbody>
中的行。我怎样才能让它继续浏览页面上其余的 <tbody>
元素?
另一个问题:这个 table 总共有大约 1000 行,每页最多只显示 100 行。有没有办法重写下面的代码以一次遍历整个 table 而不是只遍历页面上显示的 100 行?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')
table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')
df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])
for row in table.tbody.find_all('tr'):
columns = row.find_all('td')
Player = columns[1].text.strip()
Team = columns[2].text.strip()
GamesPlayed = columns[3].text.strip()
Goals = columns[4].text.strip()
Assists = columns[5].text.strip()
TotalPoints = columns[6].text.strip()
PointsPerGame = columns[7].text.strip()
PIM = columns[8].text.strip()
PM = columns[9].text.strip()
df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)
要将所有玩家统计数据加载到数据框中并将其保存到 csv,您可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1, 11):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
打印:
...
1132 973.0 Austin Poganski (RW) Winnipeg Jets 16 0 0 0 0.00 7 -3.0
1133 974.0 Mikhail Maltsev (LW) Colorado Avalanche 18 0 0 0 0.00 2 -5.0
1134 975.0 Mason Geertsen (D/LW) New Jersey Devils 23 0 0 0 0.00 62 -4.0
1135 976.0 Jack McBain (C) Arizona Coyotes - - - - - - NaN
1136 977.0 Jordan Harris (D) Montréal Canadiens - - - - - - NaN
1137 978.0 Nikolai Knyzhov (D) San Jose Sharks - - - - - - NaN
1138 979.0 Marc McLaughlin (C) Boston Bruins - - - - - - NaN
1139 980.0 Carson Meyer (RW) Columbus Blue Jackets - - - - - - NaN
1140 981.0 Leon Gawanke (D) Winnipeg Jets - - - - - - NaN
1141 982.0 Brady Keeper (D) Vancouver Canucks - - - - - - NaN
1142 983.0 Miles Wood (LW) New Jersey Devils - - - - - - NaN
1143 984.0 Samuel Morin (D/LW) Philadelphia Flyers - - - - - - NaN
1144 985.0 Connor Carrick (D) Seattle Kraken - - - - - - NaN
1145 986.0 Micheal Ferland (LW/RW) Vancouver Canucks - - - - - - NaN
1146 987.0 Jake Gardiner (D) Carolina Hurricanes - - - - - - NaN
1147 988.0 Oscar Klefbom (D) Edmonton Oilers - - - - - - NaN
1148 989.0 Shea Weber (D) Montréal Canadiens - - - - - - NaN
1149 990.0 Brandon Sutter (C/RW) Vancouver Canucks - - - - - - NaN
1150 991.0 Brent Seabrook (D) Tampa Bay Lightning - - - - - - NaN
并保存 data.csv
(来自 LibreOffice 的屏幕截图):
我正在尝试使用 Python 和 BeautifulSoup 从 the top table on this page(“2021-2022 常规赛季球员统计数据”)中抓取数据。该页面显示 100 名 NHL 球员的统计数据,每行 1 名球员。下面的代码有效,但问题是它只将前十行拉入数据框。这是因为每十行在一个单独的 <tbody>
中,所以它只遍历第一个 <tbody>
中的行。我怎样才能让它继续浏览页面上其余的 <tbody>
元素?
另一个问题:这个 table 总共有大约 1000 行,每页最多只显示 100 行。有没有办法重写下面的代码以一次遍历整个 table 而不是只遍历页面上显示的 100 行?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')
table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')
df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])
for row in table.tbody.find_all('tr'):
columns = row.find_all('td')
Player = columns[1].text.strip()
Team = columns[2].text.strip()
GamesPlayed = columns[3].text.strip()
Goals = columns[4].text.strip()
Assists = columns[5].text.strip()
TotalPoints = columns[6].text.strip()
PointsPerGame = columns[7].text.strip()
PIM = columns[8].text.strip()
PM = columns[9].text.strip()
df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)
要将所有玩家统计数据加载到数据框中并将其保存到 csv,您可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1, 11):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
打印:
...
1132 973.0 Austin Poganski (RW) Winnipeg Jets 16 0 0 0 0.00 7 -3.0
1133 974.0 Mikhail Maltsev (LW) Colorado Avalanche 18 0 0 0 0.00 2 -5.0
1134 975.0 Mason Geertsen (D/LW) New Jersey Devils 23 0 0 0 0.00 62 -4.0
1135 976.0 Jack McBain (C) Arizona Coyotes - - - - - - NaN
1136 977.0 Jordan Harris (D) Montréal Canadiens - - - - - - NaN
1137 978.0 Nikolai Knyzhov (D) San Jose Sharks - - - - - - NaN
1138 979.0 Marc McLaughlin (C) Boston Bruins - - - - - - NaN
1139 980.0 Carson Meyer (RW) Columbus Blue Jackets - - - - - - NaN
1140 981.0 Leon Gawanke (D) Winnipeg Jets - - - - - - NaN
1141 982.0 Brady Keeper (D) Vancouver Canucks - - - - - - NaN
1142 983.0 Miles Wood (LW) New Jersey Devils - - - - - - NaN
1143 984.0 Samuel Morin (D/LW) Philadelphia Flyers - - - - - - NaN
1144 985.0 Connor Carrick (D) Seattle Kraken - - - - - - NaN
1145 986.0 Micheal Ferland (LW/RW) Vancouver Canucks - - - - - - NaN
1146 987.0 Jake Gardiner (D) Carolina Hurricanes - - - - - - NaN
1147 988.0 Oscar Klefbom (D) Edmonton Oilers - - - - - - NaN
1148 989.0 Shea Weber (D) Montréal Canadiens - - - - - - NaN
1149 990.0 Brandon Sutter (C/RW) Vancouver Canucks - - - - - - NaN
1150 991.0 Brent Seabrook (D) Tampa Bay Lightning - - - - - - NaN
并保存 data.csv
(来自 LibreOffice 的屏幕截图):