制作 pandas Multiindex 数据帧比追加更快的方法
Faster way to make pandas Multiindex dataframe than append
我正在寻找一种更快的方法来将数据从我的 json 对象加载到多索引数据帧中。
我的 JSON 是这样的:
{
"1990-1991": {
"Cleveland": {
"salary": ",403,000",
"players": {
"Hot Rod Williams": ",785,000",
"Danny Ferry": ",640,000",
"Mark Price": ",400,000",
"Brad Daugherty": ",320,000",
"Larry Nance": ",260,000",
"Chucky Brown": "0,000",
"Steve Kerr": "8,000",
"Derrick Chievous": "5,000",
"Winston Bennett": "5,000",
"John Morton": "0,000",
"Milos Babic": "0,000",
"Gerald Paddio": "0,000",
"Darnell Valentine": "0,000",
"Henry James": ",000"
},
"url": "https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"
},
我正在制作数据框:
df = pd.DataFrame(columns=["year", "team", "player", "salary"])
for year in nbaSalaryData.keys():
for team in nbaSalaryData[year]:
for player in nbaSalaryData[year][team]['players']:
df = df.append({
"year": year,
"team": team,
"player": player,
"salary": nbaSalaryData[year][team]['players'][player]
}, ignore_index=True)
df = df.set_index(['year', 'team', 'player']).sort_index()
df
这导致:
salary
year team player
1990-1991 Atlanta Doc Rivers 5,000
Dominique Wilkins ,065,000
Gary Leonard 0,000
John Battle 0,000
Kevin Willis 5,000
... ... ... ...
2020-2021 Washington Robin Lopez ,300,000
Rui Hachimura ,692,840
Russell Westbrook ,358,814
Thomas Bryant ,333,333
Troy Brown ,372,840
这是我想要的形式 - 年份、球队和球员作为索引,工资作为一列。我知道使用 append 很慢,但我想不出替代方法。我尝试使用元组(配置略有不同 - 没有球员和薪水)来实现它,但最终无法正常工作。
tuples = []
index = None
for year in nbaSalaryData.keys():
for team in nbaSalaryData[year]:
t = nbaSalaryData[year][team]
tuples.append((year, team))
index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])
df = index.to_frame()
df
输出:
year team
year team
1990-1991 Cleveland 1990-1991 Cleveland
New York 1990-1991 New York
Detroit 1990-1991 Detroit
LA Lakers 1990-1991 LA Lakers
Atlanta 1990-1991 Atlanta
我对 pandas 不是很熟悉,但我意识到一定有比 append()
更快的方法。
您可以按如下方式调整the answer to a very similar question:
z = json.loads(json_data)
out = pd.Series({
(i,j,m): z[i][j][k][m]
for i in z
for j in z[i]
for k in ['players']
for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())
# out:
salary
year team player
1990-1991 Cleveland Hot Rod Williams ,785,000
Danny Ferry ,640,000
Mark Price ,400,000
Brad Daugherty ,320,000
Larry Nance ,260,000
Chucky Brown 0,000
Steve Kerr 8,000
Derrick Chievous 5,000
Winston Bennett 5,000
John Morton 0,000
Milos Babic 0,000
Gerald Paddio 0,000
Darnell Valentine 0,000
Henry James ,000
此外,如果您打算对这些薪水进行一些数值分析,您可能希望它们是数字,而不是字符串。如果是这样,还要考虑:
out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))
PS: 解释:
for
行只是压平嵌套 dict
的一大理解。要了解其工作原理,请先尝试:
[
(i,j)
for i in z
for j in z[i]
]
第三个for
将列出z[i][j]
的所有键,即:['salary', 'players', 'url']
,但我们只对'players'
感兴趣,所以我们说所以。
最后一点是,我们想要 dict
而不是 list
。尝试不使用 pd.Series()
包围的表达式,您将确切地看到发生了什么。
我们可以使用 for 循环创建数据帧并在最后连接之前附加:将连接延迟到最后比在循环内附加数据帧要好得多
box = []
# data refers to the shared json in the question
for year, value in data.items():
for team, players in value.items():
content = players["players"]
content = pd.DataFrame.from_dict(
content, orient="index", columns=["salary"]
).rename_axis(index="player")
content = content.assign(year=year, team=team)
box.append(content)
box
[ salary year team
player
Hot Rod Williams ,785,000 1990-1991 Cleveland
Danny Ferry ,640,000 1990-1991 Cleveland
Mark Price ,400,000 1990-1991 Cleveland
Brad Daugherty ,320,000 1990-1991 Cleveland
Larry Nance ,260,000 1990-1991 Cleveland
Chucky Brown 0,000 1990-1991 Cleveland
Steve Kerr 8,000 1990-1991 Cleveland
Derrick Chievous 5,000 1990-1991 Cleveland
Winston Bennett 5,000 1990-1991 Cleveland
John Morton 0,000 1990-1991 Cleveland
Milos Babic 0,000 1990-1991 Cleveland
Gerald Paddio 0,000 1990-1991 Cleveland
Darnell Valentine 0,000 1990-1991 Cleveland
Henry James ,000 1990-1991 Cleveland]
连接并重新排序索引级别:
(
pd.concat(box)
.set_index(["year", "team"], append=True)
.reorder_levels(["year", "team", "player"])
)
我正在寻找一种更快的方法来将数据从我的 json 对象加载到多索引数据帧中。
我的 JSON 是这样的:
{
"1990-1991": {
"Cleveland": {
"salary": ",403,000",
"players": {
"Hot Rod Williams": ",785,000",
"Danny Ferry": ",640,000",
"Mark Price": ",400,000",
"Brad Daugherty": ",320,000",
"Larry Nance": ",260,000",
"Chucky Brown": "0,000",
"Steve Kerr": "8,000",
"Derrick Chievous": "5,000",
"Winston Bennett": "5,000",
"John Morton": "0,000",
"Milos Babic": "0,000",
"Gerald Paddio": "0,000",
"Darnell Valentine": "0,000",
"Henry James": ",000"
},
"url": "https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"
},
我正在制作数据框:
df = pd.DataFrame(columns=["year", "team", "player", "salary"])
for year in nbaSalaryData.keys():
for team in nbaSalaryData[year]:
for player in nbaSalaryData[year][team]['players']:
df = df.append({
"year": year,
"team": team,
"player": player,
"salary": nbaSalaryData[year][team]['players'][player]
}, ignore_index=True)
df = df.set_index(['year', 'team', 'player']).sort_index()
df
这导致:
salary
year team player
1990-1991 Atlanta Doc Rivers 5,000
Dominique Wilkins ,065,000
Gary Leonard 0,000
John Battle 0,000
Kevin Willis 5,000
... ... ... ...
2020-2021 Washington Robin Lopez ,300,000
Rui Hachimura ,692,840
Russell Westbrook ,358,814
Thomas Bryant ,333,333
Troy Brown ,372,840
这是我想要的形式 - 年份、球队和球员作为索引,工资作为一列。我知道使用 append 很慢,但我想不出替代方法。我尝试使用元组(配置略有不同 - 没有球员和薪水)来实现它,但最终无法正常工作。
tuples = []
index = None
for year in nbaSalaryData.keys():
for team in nbaSalaryData[year]:
t = nbaSalaryData[year][team]
tuples.append((year, team))
index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])
df = index.to_frame()
df
输出:
year team
year team
1990-1991 Cleveland 1990-1991 Cleveland
New York 1990-1991 New York
Detroit 1990-1991 Detroit
LA Lakers 1990-1991 LA Lakers
Atlanta 1990-1991 Atlanta
我对 pandas 不是很熟悉,但我意识到一定有比 append()
更快的方法。
您可以按如下方式调整the answer to a very similar question:
z = json.loads(json_data)
out = pd.Series({
(i,j,m): z[i][j][k][m]
for i in z
for j in z[i]
for k in ['players']
for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())
# out:
salary
year team player
1990-1991 Cleveland Hot Rod Williams ,785,000
Danny Ferry ,640,000
Mark Price ,400,000
Brad Daugherty ,320,000
Larry Nance ,260,000
Chucky Brown 0,000
Steve Kerr 8,000
Derrick Chievous 5,000
Winston Bennett 5,000
John Morton 0,000
Milos Babic 0,000
Gerald Paddio 0,000
Darnell Valentine 0,000
Henry James ,000
此外,如果您打算对这些薪水进行一些数值分析,您可能希望它们是数字,而不是字符串。如果是这样,还要考虑:
out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))
PS: 解释:
for
行只是压平嵌套 dict
的一大理解。要了解其工作原理,请先尝试:
[
(i,j)
for i in z
for j in z[i]
]
第三个for
将列出z[i][j]
的所有键,即:['salary', 'players', 'url']
,但我们只对'players'
感兴趣,所以我们说所以。
最后一点是,我们想要 dict
而不是 list
。尝试不使用 pd.Series()
包围的表达式,您将确切地看到发生了什么。
我们可以使用 for 循环创建数据帧并在最后连接之前附加:将连接延迟到最后比在循环内附加数据帧要好得多
box = []
# data refers to the shared json in the question
for year, value in data.items():
for team, players in value.items():
content = players["players"]
content = pd.DataFrame.from_dict(
content, orient="index", columns=["salary"]
).rename_axis(index="player")
content = content.assign(year=year, team=team)
box.append(content)
box
[ salary year team
player
Hot Rod Williams ,785,000 1990-1991 Cleveland
Danny Ferry ,640,000 1990-1991 Cleveland
Mark Price ,400,000 1990-1991 Cleveland
Brad Daugherty ,320,000 1990-1991 Cleveland
Larry Nance ,260,000 1990-1991 Cleveland
Chucky Brown 0,000 1990-1991 Cleveland
Steve Kerr 8,000 1990-1991 Cleveland
Derrick Chievous 5,000 1990-1991 Cleveland
Winston Bennett 5,000 1990-1991 Cleveland
John Morton 0,000 1990-1991 Cleveland
Milos Babic 0,000 1990-1991 Cleveland
Gerald Paddio 0,000 1990-1991 Cleveland
Darnell Valentine 0,000 1990-1991 Cleveland
Henry James ,000 1990-1991 Cleveland]
连接并重新排序索引级别:
(
pd.concat(box)
.set_index(["year", "team"], append=True)
.reorder_levels(["year", "team", "player"])
)