将以元组作为键的嵌套字典转换为数据框
Convert a nested dictionary with tuples as keys to a dataframe
所以我有以下字典:
user_dict = {'user1': {'id1': {('word1', 'word2'): 0.99, ('word3', 'word4'): 0.16},
'id2': {('word5', 'word6'): 0.73, ('word7', 'word8'): 0.69}},
'user2': {'id3': {('word9', 'word10'): 0.59, ('word11', 'word12'): 0.13},
'id4': {('word13', 'word14'): 0.41, ('word14', 'word15'): 0.74}}}
出于我的目的,我想将嵌套字典转换为 pandas 形式的数据框:
user | id | w1 | w2 | score
---------------------------------------
user1 | id1 | word1 | word2 | 0.99
| | word3 | word4 | 0.16
| id2 | word5 | word6 | 0.73 and so on.
我之前尝试过几种方法,这是我目前的解决方案:
df = pd.Series({(i,j): user_dict[i][j]
for i in user_dict.keys()
for j in user_dict[i].keys()}).rename_axis(['user', 'id']).reset_index(name='Col3')
所以输出是:
user | id | Col3
-------------------------------------------------------------------
user1 | id1 | {('word1', 'word2'): 0.99, ('word3', 'word4'): 0.16)}
user1 | id2 | {('word5', 'word6'): 0.73, ('word7', 'word8'): 0.69)} and so on.
谁能告诉我我在最后一列中做错了什么?
您可以使用嵌套列表 comprehension/generator:
df = pd.DataFrame(([k0, k1, *k2, d2]
for k0, d0 in user_dict.items()
for k1, d1 in d0.items()
for k2, d2 in d1.items()
), columns=['user', 'id', 'w1', 'w2', 'score'])
输出:
user id w1 w2 score
0 user1 id1 word1 word2 0.99
1 user1 id1 word3 word4 0.16
2 user1 id2 word5 word6 0.73
3 user1 id2 word7 word8 0.69
4 user2 id3 word9 word10 0.59
5 user2 id3 word11 word12 0.13
6 user2 id4 word13 word14 0.41
7 user2 id4 word14 word15 0.74
或者,循环更少:
>>> pd.concat({k: pd.DataFrame(v) for k, v in user_dict.items()}).melt(ignore_index=False).dropna()
variable value
user1 word1 word2 id1 0.99
word3 word4 id1 0.16
word5 word6 id2 0.73
word7 word8 id2 0.69
user2 word9 word10 id3 0.59
word11 word12 id3 0.13
word13 word14 id4 0.41
word14 word15 id4 0.74
所以我有以下字典:
user_dict = {'user1': {'id1': {('word1', 'word2'): 0.99, ('word3', 'word4'): 0.16},
'id2': {('word5', 'word6'): 0.73, ('word7', 'word8'): 0.69}},
'user2': {'id3': {('word9', 'word10'): 0.59, ('word11', 'word12'): 0.13},
'id4': {('word13', 'word14'): 0.41, ('word14', 'word15'): 0.74}}}
出于我的目的,我想将嵌套字典转换为 pandas 形式的数据框:
user | id | w1 | w2 | score
---------------------------------------
user1 | id1 | word1 | word2 | 0.99
| | word3 | word4 | 0.16
| id2 | word5 | word6 | 0.73 and so on.
我之前尝试过几种方法,这是我目前的解决方案:
df = pd.Series({(i,j): user_dict[i][j]
for i in user_dict.keys()
for j in user_dict[i].keys()}).rename_axis(['user', 'id']).reset_index(name='Col3')
所以输出是:
user | id | Col3
-------------------------------------------------------------------
user1 | id1 | {('word1', 'word2'): 0.99, ('word3', 'word4'): 0.16)}
user1 | id2 | {('word5', 'word6'): 0.73, ('word7', 'word8'): 0.69)} and so on.
谁能告诉我我在最后一列中做错了什么?
您可以使用嵌套列表 comprehension/generator:
df = pd.DataFrame(([k0, k1, *k2, d2]
for k0, d0 in user_dict.items()
for k1, d1 in d0.items()
for k2, d2 in d1.items()
), columns=['user', 'id', 'w1', 'w2', 'score'])
输出:
user id w1 w2 score
0 user1 id1 word1 word2 0.99
1 user1 id1 word3 word4 0.16
2 user1 id2 word5 word6 0.73
3 user1 id2 word7 word8 0.69
4 user2 id3 word9 word10 0.59
5 user2 id3 word11 word12 0.13
6 user2 id4 word13 word14 0.41
7 user2 id4 word14 word15 0.74
或者,循环更少:
>>> pd.concat({k: pd.DataFrame(v) for k, v in user_dict.items()}).melt(ignore_index=False).dropna()
variable value
user1 word1 word2 id1 0.99
word3 word4 id1 0.16
word5 word6 id2 0.73
word7 word8 id2 0.69
user2 word9 word10 id3 0.59
word11 word12 id3 0.13
word13 word14 id4 0.41
word14 word15 id4 0.74