将列表的元素附加到 multi-dimensional 列表中

Appending elements of a list into a multi-dimensional list

您好,我正在 this page 的 python 中使用 NBA 数据进行网络抓取。 basketball-reference 的一些元素很容易抓取,但是由于我缺乏 python 知识,这个元素给我带来了一些麻烦。

我能够获取我想要的数据和列 headers,但我最终得到了 2 个数据列表,我需要按它们的索引(我认为?)进行组合,以便索引 0 的player_injury_info 与 player_names 等的索引 0 对齐,我不知道该怎么做。

下面我粘贴了一些代码,您可以按照这些代码进行操作。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta

url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)

# this correctly gives me the 4 column headers i want (Player, Team, Update, Description)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]

# 2 lists - player_injury_info and player_names.  they need to be combined.
rows = soup.findAll('tr')
player_injury_info = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]
player_injury_info = player_injury_info[1:] # removing first element bc dont need it

player_names = [[th.getText() for th in rows[i].findAll('th')]
            for i in range(len(rows))]
player_names = player_names[1:]             # removing first element bc dont need it

### joining the lists in the correct order- the part i dont know how to do
player_list = player_names.append(player_injury_info)

### this should give me the data frame i want if i can get player_injury_info into the right format.
injury_data = pd.DataFrame(player_injury_info, columns = headers)

可能有更简单的方法将数据通过网络抓取到所有 1 个列表/数据框中?或者像我尝试做的那样将 2 个列表连接在一起就可以了。但是,如果有人能够跟进并提供解决方案,我将不胜感激!

我想你想要这个(元组列表),使用 zip:

players = ["joe", "bill"]
injuries = ["tooth-ache", "mental break"]
list(zip(players, injuries))

结果:

[('joe', 'tooth-ache'), ('bill', 'mental break')]

让 pandas 为您解析 table。

import pandas as pd

url = "https://www.basketball-reference.com/friv/injuries.fcgi"
injury_data = pd.read_html(url)[0]

输出:

print(injury_data)
              Player  ...                                        Description
0     Onyeka Okongwu  ...  Out (Shoulder) - The Hawks announced that Okon...
1       Jaylen Brown  ...  Out (Wrist) - The Celtics announced that Brown...
2         Coby White  ...  Out (Shoulder) - The Bulls announced that Whit...
3     Taurean Prince  ...  Out (Ankle) - The Cavaliers announced F Taurea...
4       Jamal Murray  ...  Out (Knee) - Murray is recovering from a torn ...
5      Klay Thompson  ...  Out (Right Achilles) - Thompson is on track to...
6      James Wiseman  ...  Out (Knee) - Wiseman is on track to be ready b...
7        T.J. Warren  ...  Out (Foot) - Warren underwent foot surgery and...
8        Serge Ibaka  ...  Out (Back) - The Clippers announced Serge Ibak...
9      Kawhi Leonard  ...  Out (Knee) - The Clippers announced Kawhi Leon...
10    Victor Oladipo  ...  Out (Knee) - Oladipo could be cleared for full...
11  Donte DiVincenzo  ...  Out (Foot) - DiVincenzo suffered a tendon inju...
12    Jarrett Culver  ...  Out (Ankle) - The Timberwolves announced Culve...
13    Markelle Fultz  ...  Out (Knee) - Fultz will miss the rest of the s...
14    Jonathan Isaac  ...  Out (Knee) - Isaac is making progress with his...
15       Dario Šarić  ...  Out (Knee) - The Suns announced that Sario has...
16      Zach Collins  ...  Out (Ankle) - The Blazers announced that Colli...
17     Pascal Siakam  ...  Out (Shoulder) - The Raptors announced Pascal ...
18       Deni Avdija  ...  Out (Leg) - The Wizards announced that Avdija ...
19     Thomas Bryant  ...  Out (Left knee) - The Wizards announced that B...

[20 rows x 4 columns]

但是如果你要自己迭代它,我会简单地获取行(<tr> 标签),然后在 <a> 标签中获取玩家名称,并将它与那个结合起来行的 <td> 标签。然后从这些列表中创建您的数据框:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta

url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)

headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]

trs = soup.findAll('tr')[1:]
rows = []
for tr in trs:
    player_name = tr.find('a').text
    data = [player_name] + [x.text for x in tr.find_all('td')]
    rows.append(data)

injury_data = pd.DataFrame(rows, columns = headers)