将列表中的元组字典转换为 Pandas 数据框
Transform a Dictionary of Tuples in a List into a Pandas Dataframe
我有一个列表中的元组字典,想将它们转换为 pandas 数据框,但遇到了一些困难。
我的数据如下:
{0: [('A1', 0.0037505763997138838),
('A2', 0.0036963076240675245),
('A3', 0.0035451257931104485),
('A4', 0.003501467316849233),
('A5', 0.00343229837150675),
('A6', 0.0033731723637910062),
('A7', 0.0033713118048861465),
('A8', 0.003325231288305062),
('A9', 0.002885164987475754),
('A10', 0.0028834984584371797)],
1: [('B1', 0.011094831353420088),
('B2', 0.009526049091086916),
('B3', 0.007002935827927014),
('B4', 0.00511673700015512),
('B5', 0.004870300921667765),
('B6', 0.004496108376557714),
('B7', 0.004230892962061271),
('B8', 0.004137434850455194),
('B9', 0.003958335393193675),
('B10', 0.0038285145788315993)]}
我想在Pandas
中将其转换为以下内容
num label probs
0 A1 0.0037505763997138838
0 A2 0.0036963076240675245
0 A3 0.0035451257931104485
0 A4 0.003501467316849233
0 A5 0.00343229837150675
0 A6 0.0033731723637910062
0 A7 0.0033713118048861465
0 A8 0.003325231288305062
0 A9 0.002885164987475754
0 A10 0.0028834984584371797
1 B1 0.011094831353420088
1 B2 0.009526049091086916
1 B3 0.007002935827927014
1 B4 0.00511673700015512
1 B5 0.004870300921667765
1 B6 0.004496108376557714
1 B7 0.004230892962061271
1 B8 0.004137434850455194
1 B9 0.003958335393193675
1 B10 0.0038285145788315993
你的字典需要修改一下。这里我使用 itertools.chain
来组合值:
from itertools import chain
import pandas as pd
import numpy as np
df = (pd.DataFrame(list(chain(*d.values())),
columns=['label', 'probs'],
index=np.repeat(list(d), list(map(len, d.values()))))
.rename_axis('num')
.reset_index()
)
输出:
num label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
...
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
我们可以使用理解语法创建一个三元组列表(名称、标签和概率),然后您可以轻松地从这个列表创建数据框
c = ['name', 'label', 'probs']
pd.DataFrame([(k, *t) for k, v in d.items() for t in v], columns=c)
name label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
5 0 A6 0.003373
6 0 A7 0.003371
7 0 A8 0.003325
8 0 A9 0.002885
9 0 A10 0.002883
10 1 B1 0.011095
11 1 B2 0.009526
12 1 B3 0.007003
13 1 B4 0.005117
14 1 B5 0.004870
15 1 B6 0.004496
16 1 B7 0.004231
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
你可以试试:
(假设data
是字典的名称:)
df = (pd.Series(data)
.explode()
.apply(pd.Series)
.reset_index()
)
df.columns = ['num', 'label', 'probs']
结果:
print(df)
num label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
5 0 A6 0.003373
6 0 A7 0.003371
7 0 A8 0.003325
8 0 A9 0.002885
9 0 A10 0.002883
10 1 B1 0.011095
11 1 B2 0.009526
12 1 B3 0.007003
13 1 B4 0.005117
14 1 B5 0.004870
15 1 B6 0.004496
16 1 B7 0.004231
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
或者,你也可以用pd.DataFrame()
代替第2个pd.Series()
以获得更好的性能(感谢@anky的建议),如下:
s = pd.Series(data).explode()
df = (pd.DataFrame(s.tolist(),columns=['label', 'probs'], index=s.index)
.rename_axis(index='num')
.reset_index()
)
结果:
print(df)
num label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
5 0 A6 0.003373
6 0 A7 0.003371
7 0 A8 0.003325
8 0 A9 0.002885
9 0 A10 0.002883
10 1 B1 0.011095
11 1 B2 0.009526
12 1 B3 0.007003
13 1 B4 0.005117
14 1 B5 0.004870
15 1 B6 0.004496
16 1 B7 0.004231
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
我有一个列表中的元组字典,想将它们转换为 pandas 数据框,但遇到了一些困难。
我的数据如下:
{0: [('A1', 0.0037505763997138838),
('A2', 0.0036963076240675245),
('A3', 0.0035451257931104485),
('A4', 0.003501467316849233),
('A5', 0.00343229837150675),
('A6', 0.0033731723637910062),
('A7', 0.0033713118048861465),
('A8', 0.003325231288305062),
('A9', 0.002885164987475754),
('A10', 0.0028834984584371797)],
1: [('B1', 0.011094831353420088),
('B2', 0.009526049091086916),
('B3', 0.007002935827927014),
('B4', 0.00511673700015512),
('B5', 0.004870300921667765),
('B6', 0.004496108376557714),
('B7', 0.004230892962061271),
('B8', 0.004137434850455194),
('B9', 0.003958335393193675),
('B10', 0.0038285145788315993)]}
我想在Pandas
中将其转换为以下内容num label probs
0 A1 0.0037505763997138838
0 A2 0.0036963076240675245
0 A3 0.0035451257931104485
0 A4 0.003501467316849233
0 A5 0.00343229837150675
0 A6 0.0033731723637910062
0 A7 0.0033713118048861465
0 A8 0.003325231288305062
0 A9 0.002885164987475754
0 A10 0.0028834984584371797
1 B1 0.011094831353420088
1 B2 0.009526049091086916
1 B3 0.007002935827927014
1 B4 0.00511673700015512
1 B5 0.004870300921667765
1 B6 0.004496108376557714
1 B7 0.004230892962061271
1 B8 0.004137434850455194
1 B9 0.003958335393193675
1 B10 0.0038285145788315993
你的字典需要修改一下。这里我使用 itertools.chain
来组合值:
from itertools import chain
import pandas as pd
import numpy as np
df = (pd.DataFrame(list(chain(*d.values())),
columns=['label', 'probs'],
index=np.repeat(list(d), list(map(len, d.values()))))
.rename_axis('num')
.reset_index()
)
输出:
num label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
...
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
我们可以使用理解语法创建一个三元组列表(名称、标签和概率),然后您可以轻松地从这个列表创建数据框
c = ['name', 'label', 'probs']
pd.DataFrame([(k, *t) for k, v in d.items() for t in v], columns=c)
name label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
5 0 A6 0.003373
6 0 A7 0.003371
7 0 A8 0.003325
8 0 A9 0.002885
9 0 A10 0.002883
10 1 B1 0.011095
11 1 B2 0.009526
12 1 B3 0.007003
13 1 B4 0.005117
14 1 B5 0.004870
15 1 B6 0.004496
16 1 B7 0.004231
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
你可以试试:
(假设data
是字典的名称:)
df = (pd.Series(data)
.explode()
.apply(pd.Series)
.reset_index()
)
df.columns = ['num', 'label', 'probs']
结果:
print(df)
num label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
5 0 A6 0.003373
6 0 A7 0.003371
7 0 A8 0.003325
8 0 A9 0.002885
9 0 A10 0.002883
10 1 B1 0.011095
11 1 B2 0.009526
12 1 B3 0.007003
13 1 B4 0.005117
14 1 B5 0.004870
15 1 B6 0.004496
16 1 B7 0.004231
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829
或者,你也可以用pd.DataFrame()
代替第2个pd.Series()
以获得更好的性能(感谢@anky的建议),如下:
s = pd.Series(data).explode()
df = (pd.DataFrame(s.tolist(),columns=['label', 'probs'], index=s.index)
.rename_axis(index='num')
.reset_index()
)
结果:
print(df)
num label probs
0 0 A1 0.003751
1 0 A2 0.003696
2 0 A3 0.003545
3 0 A4 0.003501
4 0 A5 0.003432
5 0 A6 0.003373
6 0 A7 0.003371
7 0 A8 0.003325
8 0 A9 0.002885
9 0 A10 0.002883
10 1 B1 0.011095
11 1 B2 0.009526
12 1 B3 0.007003
13 1 B4 0.005117
14 1 B5 0.004870
15 1 B6 0.004496
16 1 B7 0.004231
17 1 B8 0.004137
18 1 B9 0.003958
19 1 B10 0.003829