制作 pandas Multiindex 数据帧比追加更快的方法

Faster way to make pandas Multiindex dataframe than append

我正在寻找一种更快的方法来将数据从我的 json 对象加载到多索引数据帧中。

我的 JSON 是这样的:

    {
        "1990-1991": {
            "Cleveland": {
                "salary": ",403,000",
                "players": {
                    "Hot Rod Williams": ",785,000",
                    "Danny Ferry": ",640,000",
                    "Mark Price": ",400,000",
                    "Brad Daugherty": ",320,000",
                    "Larry Nance": ",260,000",
                    "Chucky Brown": "0,000",
                    "Steve Kerr": "8,000",
                    "Derrick Chievous": "5,000",
                    "Winston Bennett": "5,000",
                    "John Morton": "0,000",
                    "Milos Babic": "0,000",
                    "Gerald Paddio": "0,000",
                    "Darnell Valentine": "0,000",
                    "Henry James": ",000"
                },
                "url": "https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"
            },

我正在制作数据框:

    df = pd.DataFrame(columns=["year", "team", "player", "salary"])
    
    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            for player in nbaSalaryData[year][team]['players']:
                df = df.append({
                        "year": year,
                        "team": team,
                        "player": player,
                        "salary": nbaSalaryData[year][team]['players'][player]
                    }, ignore_index=True)
    
    df = df.set_index(['year', 'team', 'player']).sort_index()
    df

这导致:

                                              salary 
    year       team     player
    1990-1991  Atlanta  Doc Rivers          5,000
                        Dominique Wilkins   ,065,000
                        Gary Leonard        0,000
                        John Battle         0,000
                        Kevin Willis        5,000
    ... ... ... ...
    2020-2021   Washington  Robin Lopez     ,300,000
                        Rui Hachimura       ,692,840
                        Russell Westbrook   ,358,814
                        Thomas Bryant       ,333,333
                        Troy Brown          ,372,840

这是我想要的形式 - 年份、球队和球员作为索引,工资作为一列。我知道使用 append 很慢,但我想不出替代方法。我尝试使用元组(配置略有不同 - 没有球员和薪水)来实现它,但最终无法正常工作。

    tuples = []
    index = None

    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            t = nbaSalaryData[year][team]
            tuples.append((year, team))

    index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])
    df = index.to_frame()
    df

输出:

                             year   team
    year    team        
    1990-1991   Cleveland   1990-1991   Cleveland
                New York    1990-1991   New York
                Detroit     1990-1991   Detroit
                LA Lakers   1990-1991   LA Lakers
                Atlanta     1990-1991   Atlanta  

我对 pandas 不是很熟悉,但我意识到一定有比 append() 更快的方法。

您可以按如下方式调整the answer to a very similar question

z = json.loads(json_data)

out = pd.Series({
    (i,j,m): z[i][j][k][m]
    for i in z
    for j in z[i]
    for k in ['players']
    for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())

# out:

                                           salary
year      team      player                       
1990-1991 Cleveland Hot Rod Williams   ,785,000
                    Danny Ferry        ,640,000
                    Mark Price         ,400,000
                    Brad Daugherty     ,320,000
                    Larry Nance        ,260,000
                    Chucky Brown         0,000
                    Steve Kerr           8,000
                    Derrick Chievous     5,000
                    Winston Bennett      5,000
                    John Morton          0,000
                    Milos Babic          0,000
                    Gerald Paddio        0,000
                    Darnell Valentine    0,000
                    Henry James           ,000

此外,如果您打算对这些薪水进行一些数值分析,您可能希望它们是数字,而不是字符串。如果是这样,还要考虑:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS: 解释:

for 行只是压平嵌套 dict 的一大理解。要了解其工作原理,请先尝试:

[
    (i,j)
    for i in z
    for j in z[i]
]

第三个for将列出z[i][j]的所有键,即:['salary', 'players', 'url'],但我们只对'players'感兴趣,所以我们说所以。

最后一点是,我们想要 dict 而不是 list。尝试不使用 pd.Series() 包围的表达式,您将确切地看到发生了什么。

我们可以使用 for 循环创建数据帧并在最后连接之前附加:将连接延迟到最后比在循环内附加数据帧要好得多

box = []
# data refers to the shared json in the question
for year, value in data.items():
    for team, players in value.items():
        content = players["players"]
        content = pd.DataFrame.from_dict(
            content, orient="index", columns=["salary"]
        ).rename_axis(index="player")
        content = content.assign(year=year, team=team)
        box.append(content)

box

[                       salary       year       team
 player                                             
 Hot Rod Williams   ,785,000  1990-1991  Cleveland
 Danny Ferry        ,640,000  1990-1991  Cleveland
 Mark Price         ,400,000  1990-1991  Cleveland
 Brad Daugherty     ,320,000  1990-1991  Cleveland
 Larry Nance        ,260,000  1990-1991  Cleveland
 Chucky Brown         0,000  1990-1991  Cleveland
 Steve Kerr           8,000  1990-1991  Cleveland
 Derrick Chievous     5,000  1990-1991  Cleveland
 Winston Bennett      5,000  1990-1991  Cleveland
 John Morton          0,000  1990-1991  Cleveland
 Milos Babic          0,000  1990-1991  Cleveland
 Gerald Paddio        0,000  1990-1991  Cleveland
 Darnell Valentine    0,000  1990-1991  Cleveland
 Henry James           ,000  1990-1991  Cleveland]

连接并重新排序索引级别:

(
    pd.concat(box)
    .set_index(["year", "team"], append=True)
    .reorder_levels(["year", "team", "player"])
)