如何使用列表将数据解析为 2 列中的列
How to parse data to columns from 2 columns with lists
我得到的数据看起来像那样
第一列 - 事件名称,第二列 - 用户 ID,第三列 - 列名,第四列 - 值。对于不同的事件名称,我要解析的列数也不同。
我需要这样的数据:
event name
user id
column 1(dialog_id)
...
column n (tutorial_id)
...
Dialogs
01f0548f-5bc6-460b-a702-cfe0aa4b90b8
EcORCTz
...
None
...
Tutorial_step
01f0548f-5bc6-460b-a702-cfe0aa4b90b8
None
...
LjzXlwB
...
某些列可能会在不同的 event_names 中重复。
之前我用过这样的东西:
df['param_keys'] = df['param_keys'].apply(lambda x: ast.literal_eval(x))
df['param_values'] = df['param_values'].apply(lambda x: ast.literal_eval(x))
return(df)
它工作得很好,但现在它 returns 只是同一个数据帧。
我附上the example of data试试看。
非常感谢!
只需从 df['param_values'].tolist()
创建一个新数据框并使用 pd.concat
将其与原始数据框连接在一起:
# Convert the lists to *actual* lists, instead of just strings
df[['param_keys', 'param_values']] = df[['param_keys', 'param_values']].apply(lambda col: col.apply(ast.literal_eval))
# Process
df = pd.concat([df.drop(['param_values', 'param_keys'], axis=1), pd.DataFrame(df['param_values'].tolist(), columns=df.loc[0, 'param_keys'])], axis=1)
输出:
>>> df
event_name idfa user_event_time tutorial_id tutorial_name step step_name
0 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:44.000+0000 K5MDyLy C_001_Lighthouse_Start 1 ShowCutscene
1 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:44.000+0000 K5MDyLy C_001_Lighthouse_Start 0 ToggleInput
2 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:48.000+0000 K5MDyLy C_001_Lighthouse_Start 2 ToggleInput
3 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 eIdJreF T_Map_001_Lighthouse_01 0 ToggleInput
4 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 eIdJreF T_Map_001_Lighthouse_01 2 Delay
.. ... ... ... ... ... ... ...
995 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:40:48.000+0000 uC954aC T_Map_003_Diner_01 2 TapBuilding
996 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:40:53.000+0000 uC954aC T_Map_003_Diner_01 4 UnlockBuilding
997 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:40:53.000+0000 uC954aC T_Map_003_Diner_01 3 ToggleInput
998 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:41:08.000+0000 QBK4BoO T_Merge_010_Diner_01 3 ToggleInput
999 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:41:08.000+0000 QBK4BoO T_Merge_010_Diner_01 2 MergeItems
这可以使用 pivot_table 来实现。考虑以下代码:
df = pd.read_csv('Sample.csv')
df['idx1'] = df.index
df['param_keys'] = df['param_keys'].apply(lambda x: ast.literal_eval(x))
df['param_values'] = df['param_values'].apply(lambda x: ast.literal_eval(x))
#explodes param_keys and param_values list columns in to separate rows.
df = df.explode(['param_keys','param_values'])
#converting rows to columns
df = df.pivot_table(index=['idx1', 'event_name', 'idfa', 'user_event_time'],
columns=['param_keys'],values='param_values',
aggfunc=lambda x: ''.join(x)).fillna(0)
print(df)
您的 csv 文件的前十行是:
param_keys step step_name tutorial_id tutorial_name
idx1 event_name idfa user_event_time
0 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:44.000+0000 1 ShowCutscene K5MDyLy C_001_Lighthouse_Start
1 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:48.000+0000 2 ToggleInput K5MDyLy C_001_Lighthouse_Start
2 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 0 ToggleInput eIdJreF T_Map_001_Lighthouse_01
3 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 2 Delay eIdJreF T_Map_001_Lighthouse_01
4 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 1 SetBuildingBase eIdJreF T_Map_001_Lighthouse_01
5 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 2 MergeItems LjzXIwB T_Merge_001_Lighthouse_01
6 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 1 Delay LjzXIwB T_Merge_001_Lighthouse_01
7 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 1 TapBuilding vQXKqlS T_Map_002_Lighthouse_02b
8 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 0 ToggleInput vQXKqlS T_Map_002_Lighthouse_02b
9 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 3 ToggleInput eIdJreF T_Map_001_Lighthouse_01
我得到的数据看起来像那样
第一列 - 事件名称,第二列 - 用户 ID,第三列 - 列名,第四列 - 值。对于不同的事件名称,我要解析的列数也不同。
我需要这样的数据:
event name | user id | column 1(dialog_id) | ... | column n (tutorial_id) | ... |
---|---|---|---|---|---|
Dialogs | 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 | EcORCTz | ... | None | ... |
Tutorial_step | 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 | None | ... | LjzXlwB | ... |
某些列可能会在不同的 event_names 中重复。
之前我用过这样的东西:
df['param_keys'] = df['param_keys'].apply(lambda x: ast.literal_eval(x))
df['param_values'] = df['param_values'].apply(lambda x: ast.literal_eval(x))
return(df)
它工作得很好,但现在它 returns 只是同一个数据帧。
我附上the example of data试试看。
非常感谢!
只需从 df['param_values'].tolist()
创建一个新数据框并使用 pd.concat
将其与原始数据框连接在一起:
# Convert the lists to *actual* lists, instead of just strings
df[['param_keys', 'param_values']] = df[['param_keys', 'param_values']].apply(lambda col: col.apply(ast.literal_eval))
# Process
df = pd.concat([df.drop(['param_values', 'param_keys'], axis=1), pd.DataFrame(df['param_values'].tolist(), columns=df.loc[0, 'param_keys'])], axis=1)
输出:
>>> df
event_name idfa user_event_time tutorial_id tutorial_name step step_name
0 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:44.000+0000 K5MDyLy C_001_Lighthouse_Start 1 ShowCutscene
1 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:44.000+0000 K5MDyLy C_001_Lighthouse_Start 0 ToggleInput
2 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:48.000+0000 K5MDyLy C_001_Lighthouse_Start 2 ToggleInput
3 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 eIdJreF T_Map_001_Lighthouse_01 0 ToggleInput
4 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 eIdJreF T_Map_001_Lighthouse_01 2 Delay
.. ... ... ... ... ... ... ...
995 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:40:48.000+0000 uC954aC T_Map_003_Diner_01 2 TapBuilding
996 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:40:53.000+0000 uC954aC T_Map_003_Diner_01 4 UnlockBuilding
997 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:40:53.000+0000 uC954aC T_Map_003_Diner_01 3 ToggleInput
998 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:41:08.000+0000 QBK4BoO T_Merge_010_Diner_01 3 ToggleInput
999 Tutorial_step d6949452-3bf2-4599-ac48-7a3df334da98 2021-12-11T07:41:08.000+0000 QBK4BoO T_Merge_010_Diner_01 2 MergeItems
这可以使用 pivot_table 来实现。考虑以下代码:
df = pd.read_csv('Sample.csv')
df['idx1'] = df.index
df['param_keys'] = df['param_keys'].apply(lambda x: ast.literal_eval(x))
df['param_values'] = df['param_values'].apply(lambda x: ast.literal_eval(x))
#explodes param_keys and param_values list columns in to separate rows.
df = df.explode(['param_keys','param_values'])
#converting rows to columns
df = df.pivot_table(index=['idx1', 'event_name', 'idfa', 'user_event_time'],
columns=['param_keys'],values='param_values',
aggfunc=lambda x: ''.join(x)).fillna(0)
print(df)
您的 csv 文件的前十行是:
param_keys step step_name tutorial_id tutorial_name
idx1 event_name idfa user_event_time
0 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:44.000+0000 1 ShowCutscene K5MDyLy C_001_Lighthouse_Start
1 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:48.000+0000 2 ToggleInput K5MDyLy C_001_Lighthouse_Start
2 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 0 ToggleInput eIdJreF T_Map_001_Lighthouse_01
3 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 2 Delay eIdJreF T_Map_001_Lighthouse_01
4 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:52.000+0000 1 SetBuildingBase eIdJreF T_Map_001_Lighthouse_01
5 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 2 MergeItems LjzXIwB T_Merge_001_Lighthouse_01
6 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 1 Delay LjzXIwB T_Merge_001_Lighthouse_01
7 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 1 TapBuilding vQXKqlS T_Map_002_Lighthouse_02b
8 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 0 ToggleInput vQXKqlS T_Map_002_Lighthouse_02b
9 Tutorial_step 01f0548f-5bc6-460b-a702-cfe0aa4b90b8 2021-12-10T00:00:56.000+0000 3 ToggleInput eIdJreF T_Map_001_Lighthouse_01