使用 pandas - 单行输出将数据帧从长转换为宽
Transform the dataframe from long to wide using pandas - Single row output
我有一个如下所示的数据框
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2],
'date':['2173/04/11','2173/04/12','2173/04/11','2173/04/12','2173/05/14','2173/05/15','2173/05/14','2173/05/15'],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00','2173/05/14 13:37:00','2173/05/15 13:39:00','2173/05/14 18:37:00','2173/05/15 19:39:00'],
'val' :[5,5,40,40,7,7,38,38],
'iid' :[12,12,12,12,21,21,21,21]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
我尝试使用 stack,unstack,pivot and melt
方法,但似乎没有帮助
pd.melt(df, id_vars =['subject_id','val'], value_vars =['date','val']) #1
df.unstack().reset_index() #2
df.pivot(index='subject_id', columns='time_1', values='val') #3
我希望我的输出数据框如下所示
更新截图
想法是由 GroupBy.cumcount
with same column/columns for new index - here subject_id
, create MultiIndex
, reshape by DataFrame.unstack
创建助手系列并最后展平 MulitIndex in columns
:
cols = ['time_1','val']
df = df.set_index(['subject_id', df.groupby('subject_id').cumcount().add(1)])[cols].unstack()
df.columns = [f'{a}{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
subject_id time_11 time_12 time_13 \
0 1 2173-04-11 12:35:00 2173-04-12 12:50:00 2173-04-11 12:59:00
1 2 2173-05-14 13:37:00 2173-05-15 13:39:00 2173-05-14 18:37:00
time_14 val1 val2 val3 val4
0 2173-04-12 13:14:00 5 5 40 40
1 2173-05-15 19:39:00 7 7 38 38
如果 id
组的数量不同,则需要缺失值 - unstack
使用最大计数,然后添加缺失值:
df = pd.DataFrame({
'subject_id':[1,1,1,2,2,3],
'date':['2173/04/11','2173/04/12','2173/04/11','2173/04/12','2173/05/14','2173/05/15'],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00',
'2173/04/12 13:14:00','2173/05/14 13:37:00','2173/05/15 13:39:00'],
'val' :[5,5,40,40,7,7],
'iid' :[12,12,12,12,21,21]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
print (df)
subject_id date time_1 val iid day
0 1 2173/04/11 2173-04-11 12:35:00 5 12 11
1 1 2173/04/12 2173-04-12 12:50:00 5 12 12
2 1 2173/04/11 2173-04-11 12:59:00 40 12 11
3 2 2173/04/12 2173-04-12 13:14:00 40 12 12
4 2 2173/05/14 2173-05-14 13:37:00 7 21 14
5 3 2173/05/15 2173-05-15 13:39:00 7 21 15
cols = ['time_1','val']
df = df.set_index(['subject_id', df.groupby('subject_id').cumcount().add(1)])[cols].unstack()
df.columns = [f'{a}{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
subject_id time_11 time_12 time_13 \
0 1 2173-04-11 12:35:00 2173-04-12 12:50:00 2173-04-11 12:59:00
1 2 2173-04-12 13:14:00 2173-05-14 13:37:00 NaT
2 3 2173-05-15 13:39:00 NaT NaT
val1 val2 val3
0 5.0 5.0 40.0
1 40.0 7.0 NaN
2 7.0 NaN NaN
我有一个如下所示的数据框
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2],
'date':['2173/04/11','2173/04/12','2173/04/11','2173/04/12','2173/05/14','2173/05/15','2173/05/14','2173/05/15'],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00','2173/05/14 13:37:00','2173/05/15 13:39:00','2173/05/14 18:37:00','2173/05/15 19:39:00'],
'val' :[5,5,40,40,7,7,38,38],
'iid' :[12,12,12,12,21,21,21,21]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
我尝试使用 stack,unstack,pivot and melt
方法,但似乎没有帮助
pd.melt(df, id_vars =['subject_id','val'], value_vars =['date','val']) #1
df.unstack().reset_index() #2
df.pivot(index='subject_id', columns='time_1', values='val') #3
我希望我的输出数据框如下所示
更新截图
想法是由 GroupBy.cumcount
with same column/columns for new index - here subject_id
, create MultiIndex
, reshape by DataFrame.unstack
创建助手系列并最后展平 MulitIndex in columns
:
cols = ['time_1','val']
df = df.set_index(['subject_id', df.groupby('subject_id').cumcount().add(1)])[cols].unstack()
df.columns = [f'{a}{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
subject_id time_11 time_12 time_13 \
0 1 2173-04-11 12:35:00 2173-04-12 12:50:00 2173-04-11 12:59:00
1 2 2173-05-14 13:37:00 2173-05-15 13:39:00 2173-05-14 18:37:00
time_14 val1 val2 val3 val4
0 2173-04-12 13:14:00 5 5 40 40
1 2173-05-15 19:39:00 7 7 38 38
如果 id
组的数量不同,则需要缺失值 - unstack
使用最大计数,然后添加缺失值:
df = pd.DataFrame({
'subject_id':[1,1,1,2,2,3],
'date':['2173/04/11','2173/04/12','2173/04/11','2173/04/12','2173/05/14','2173/05/15'],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00',
'2173/04/12 13:14:00','2173/05/14 13:37:00','2173/05/15 13:39:00'],
'val' :[5,5,40,40,7,7],
'iid' :[12,12,12,12,21,21]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
print (df)
subject_id date time_1 val iid day
0 1 2173/04/11 2173-04-11 12:35:00 5 12 11
1 1 2173/04/12 2173-04-12 12:50:00 5 12 12
2 1 2173/04/11 2173-04-11 12:59:00 40 12 11
3 2 2173/04/12 2173-04-12 13:14:00 40 12 12
4 2 2173/05/14 2173-05-14 13:37:00 7 21 14
5 3 2173/05/15 2173-05-15 13:39:00 7 21 15
cols = ['time_1','val']
df = df.set_index(['subject_id', df.groupby('subject_id').cumcount().add(1)])[cols].unstack()
df.columns = [f'{a}{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
subject_id time_11 time_12 time_13 \
0 1 2173-04-11 12:35:00 2173-04-12 12:50:00 2173-04-11 12:59:00
1 2 2173-04-12 13:14:00 2173-05-14 13:37:00 NaT
2 3 2173-05-15 13:39:00 NaT NaT
val1 val2 val3
0 5.0 5.0 40.0
1 40.0 7.0 NaN
2 7.0 NaN NaN