如何转换数据框,使列值是行值
how to transform dataframe so that column values are row values
我有以下数据框,如下所示:
df = pd.DataFrame({'fruit': ['berries','berries', 'berries', 'tropical',
'tropical','tropical','berries','nuts'],
'code': [100,100,100,200,200, 300,400,500],
'subcode': ['100A', '100B', '100C','200A', '200B','300A',
'400A', '500A']})
code fruit subcode
0 100 berries 100A
1 100 berries 100B
2 100 berries 100C
3 200 tropica 200A
4 200 tropical 200B
5 300 tropical 300A
6 400 berries 400A
7 500 nuts 500A
我想将数据帧转换为这种格式:
code fruit subcode1 subcode1 subcode1
0 100 berries 100A 100B 100C
3 200 tropica 200A 200B
5 300 tropical 300A
6 400 berries 400A
7 500 nuts 500A
不幸的是,我不知道如何继续。我已经查阅了 之类的帖子,并且有堆栈和取消堆栈的组合。我怀疑也涉及一些串联。非常感谢任何帮助我指明正确方向的建议!
稍微尝试 set_index
和 unstack
,您就会明白。
(df.set_index(['code', 'fruit'])
.set_index(df.subcode.str.extract('([a-zA-Z]+)', expand=False), append=True)
.subcode
.unstack()
.fillna('') # these last three
.reset_index() # operations are
.rename_axis(None, axis=1) # not important
)
code fruit A B C
0 100 berries 100A 100B 100C
1 200 tropical 200A 200B
2 300 tropical 300A
3 400 berries 400A
4 500 nuts 500A
您可以使用 groupby
,获取值并将它们转换为系列。
df.groupby(['code','fruit'])['subcode'].apply(
lambda x: x.values
).apply(pd.Series)
.add_prefix('subcode_')
subcode_0 subcode_1 subcode_2
code fruit
100 berries 100A 100B 100C
200 tropical 200A 200B NaN
300 tropical 300A NaN NaN
400 berries 400A NaN NaN
500 nuts 500A NaN NaN
和defaultdict
from collections import defaultdict
d = defaultdict(list)
for f, c, s in df.itertuples(index=False):
d[(f, c)].append(s)
pd.DataFrame.from_dict(
{k: dict(enumerate(v)) for k, v in d.items()}, orient='index'
).add_prefix('subcode').rename_axis(['fruit', 'code']).reset_index()
fruit code subcode0 subcode1 subcode2
0 berries 100 100A 100B 100C
1 berries 400 400A NaN NaN
2 nuts 500 500A NaN NaN
3 tropical 200 200A 200B NaN
4 tropical 300 300A NaN NaN
我有以下数据框,如下所示:
df = pd.DataFrame({'fruit': ['berries','berries', 'berries', 'tropical',
'tropical','tropical','berries','nuts'],
'code': [100,100,100,200,200, 300,400,500],
'subcode': ['100A', '100B', '100C','200A', '200B','300A',
'400A', '500A']})
code fruit subcode
0 100 berries 100A
1 100 berries 100B
2 100 berries 100C
3 200 tropica 200A
4 200 tropical 200B
5 300 tropical 300A
6 400 berries 400A
7 500 nuts 500A
我想将数据帧转换为这种格式:
code fruit subcode1 subcode1 subcode1
0 100 berries 100A 100B 100C
3 200 tropica 200A 200B
5 300 tropical 300A
6 400 berries 400A
7 500 nuts 500A
不幸的是,我不知道如何继续。我已经查阅了
稍微尝试 set_index
和 unstack
,您就会明白。
(df.set_index(['code', 'fruit'])
.set_index(df.subcode.str.extract('([a-zA-Z]+)', expand=False), append=True)
.subcode
.unstack()
.fillna('') # these last three
.reset_index() # operations are
.rename_axis(None, axis=1) # not important
)
code fruit A B C
0 100 berries 100A 100B 100C
1 200 tropical 200A 200B
2 300 tropical 300A
3 400 berries 400A
4 500 nuts 500A
您可以使用 groupby
,获取值并将它们转换为系列。
df.groupby(['code','fruit'])['subcode'].apply(
lambda x: x.values
).apply(pd.Series)
.add_prefix('subcode_')
subcode_0 subcode_1 subcode_2
code fruit
100 berries 100A 100B 100C
200 tropical 200A 200B NaN
300 tropical 300A NaN NaN
400 berries 400A NaN NaN
500 nuts 500A NaN NaN
和defaultdict
from collections import defaultdict
d = defaultdict(list)
for f, c, s in df.itertuples(index=False):
d[(f, c)].append(s)
pd.DataFrame.from_dict(
{k: dict(enumerate(v)) for k, v in d.items()}, orient='index'
).add_prefix('subcode').rename_axis(['fruit', 'code']).reset_index()
fruit code subcode0 subcode1 subcode2
0 berries 100 100A 100B 100C
1 berries 400 400A NaN NaN
2 nuts 500 500A NaN NaN
3 tropical 200 200A 200B NaN
4 tropical 300 300A NaN NaN