如何根据特定条件在数据框中插入行?
How to insert rows in dataframe based on specific condition?
I have a following dataframe:
Index
Time
User
Description
1
27.10.2021 15:58:00
UserA@gmail.com
Tab Alpha of type PARTSTUDIO opened by User A
2
27.10.2021 15:59:00
UserA@gmail.com
Start edit of part studio feature
3
27.10.2021 15:59:00
UserA@gmail.com
Cancel Operation
4
27.10.2021 15:59:00
UserB@gmail.com
Tab Alpha of type PARTSTUDIO opened by User B
5
27.10.2021 15:59:00
UserB@gmail.com
Start edit of part studio feature
6
27.10.2021 16:03:00
UserB@gmail.com
Cancel Operation
7
27.10.2021 16:03:00
UserA@gmail.com
Add assembly feature
9
27.10.2021 16:03:00
UserA@gmail.com
Tab Beta of type PARTSTUDIO opened by User A
10
27.10.2021 16:15:00
UserA@gmail.com
Start edit of part studio feature
11
27.10.2021 16:15:00
UserB@gmail.com
Start edit of part studio feature
12
27.10.2021 16:15:00
UserB@gmail.com
Tab Alpha of type PARTSTUDIO closed by User B
14
27.10.2021 16:54:00
UserB@gmail.com
Add assembly feature
15
27.10.2021 16:55:00
UserA@gmail.com
Tab Beta of type PARTSTUDIO closed by User A
16
27.10.2021 16:55:00
UserB@gmail.com
Start edit of part studio feature
17
27.10.2021 16:55:00
UserB@gmail.com
Tab Delta of type PARTSTUDIO closed by User B
预期输出:
Index
Time
User
Description
1
27.10.2021 15:58:00
UserA@gmail.com
Tab Alpha of type PARTSTUDIO opened by User A
2
27.10.2021 15:59:00
UserA@gmail.com
Start edit of part studio feature
3
27.10.2021 15:59:00
UserA@gmail.com
Cancel Operation
4
27.10.2021 15:59:00
UserB@gmail.com
Tab Alpha of type PARTSTUDIO opened by User B
5
27.10.2021 15:59:00
UserB@gmail.com
Start edit of part studio feature
6
27.10.2021 16:03:00
UserB@gmail.com
Cancel Operation
7
27.10.2021 16:03:00
UserA@gmail.com
Add assembly feature
8
27.10.2021 16:03:00
UserA@gmail.com
Tab Alpha of type PARTSTUDIO closed by User A
9
27.10.2021 16:03:00
UserA@gmail.com
Tab Beta of type PARTSTUDIO opened by User A
10
27.10.2021 16:15:00
UserA@gmail.com
Start edit of part studio feature
11
27.10.2021 16:15:00
UserB@gmail.com
Start edit of part studio feature
12
27.10.2021 16:15:00
UserB@gmail.com
Tab Alpha of type PARTSTUDIO closed by User B
13
27.10.2021 16:15:00
UserB@gmail.com
Tab Delta of type PARTSTUDIO opened by User B
14
27.10.2021 16:54:00
UserB@gmail.com
Add assembly feature
15
27.10.2021 16:55:00
UserA@gmail.com
Tab Beta of type PARTSTUDIO closed by User A
16
27.10.2021 16:55:00
UserB@gmail.com
Start edit of part studio feature
17
27.10.2021 16:55:00
UserB@gmail.com
Tab Delta of type PARTSTUDIO closed by User B
如何遍历数据框并检查描述列中每个值“Tab x opened by User y”之后,“Tab x closed by User y" 在数据框中更远的地方?如果是,可以。如果不是,如果后面是“Tab zz opened by User A”,这意味着“Tab x closed by User y”丢失并且应该在“Tab zz opened by User A”值之前插入一行(示例索引 8)。反之亦然(索引 13)。没有 df.iterrows 有没有办法做到这一点?提前致谢。
抱歉,我忘了回答这个问题。
这是一种解决方案。不是很简洁也不是特别优雅,但应该比使用 iterrows
来修改和检查未来的行更快。
数据:
Time User Description
0 27.10.2021 15:58:00 UserA@gmail.com Tab Alpha of type PARTSTUDIO opened by User A
1 27.10.2021 15:59:00 UserA@gmail.com Start edit of part studio feature
2 27.10.2021 15:59:00 UserA@gmail.com Cancel Operation
3 27.10.2021 15:59:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO opened by User B
4 27.10.2021 15:59:00 UserB@gmail.com Start edit of part studio feature
5 27.10.2021 16:03:00 UserB@gmail.com Cancel Operation
6 27.10.2021 16:03:00 UserA@gmail.com Add assembly feature
7 27.10.2021 16:03:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
8 27.10.2021 16:03:00 UserA@gmail.com Tab Gamma of type PARTSTUDIO opened by User A
9 27.10.2021 16:14:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
10 27.10.2021 16:15:00 UserA@gmail.com Start edit of part studio feature
11 27.10.2021 16:15:00 UserB@gmail.com Start edit of part studio feature
12 27.10.2021 16:15:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
13 27.10.2021 16:54:00 UserB@gmail.com Add assembly feature
14 27.10.2021 16:55:00 UserA@gmail.com Tab Beta of type PARTSTUDIO closed by User A
15 27.10.2021 16:55:00 UserB@gmail.com Start edit of part studio feature
16 27.10.2021 16:55:00 UserB@gmail.com Tab Delta of type PARTSTUDIO closed by User B
17 27.10.2021 16:56:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
18 27.10.2021 16:57:00 UserB@gmail.com Tab Beta of type PARTSTUDIO closed by User B
我确实连续添加了几个 open/close 以进行更多测试。
代码:
# Pattern to extract action info.
pattern = r'^Tab (?P<tab_name>.+) of type (?P<tab_type>.+) (?P<tab_action>\bclosed\b|\bopened\b) by (?P<user_id>.+)$'
# Add utility columns.
df = pd.concat([df, df['Description'].str.extract(pattern)], axis=1)
# Get rows with tweaked index.
def get_new_rows(df):
all_values = []
for action in ['opened', 'closed']:
action_mask = df['tab_action'].eq(action)
first_tabs = df[df['tab_action'].eq(df['tab_action'].shift(-1)) & action_mask]
second_tabs = df[df['tab_action'].eq(df['tab_action'].shift(1)) & action_mask]
if len(first_tabs) == 0:
continue
if action == 'opened':
values_tab, index_tab, offset, new_action = first_tabs, second_tabs, -0.5, 'closed'
elif action == 'closed':
values_tab, index_tab, offset, new_action = second_tabs, first_tabs, 0.5, 'opened'
values_tab.index = index_tab.index + offset
values_tab['Time'] = index_tab['Time'].to_numpy()
values_tab['tab_action'] = new_action
all_values.append(values_tab)
last_action = df.tail(1)
if last_action['tab_action'].iat[0] == 'opened':
last_action.index += 0.5
last_action['tab_action'] = 'closed'
all_values.append(last_action)
return pd.concat(all_values)
# Add new rows at the correct positions.
complete_df = pd.concat([df, df.dropna(subset='tab_action').groupby(['user_id'], as_index=False).apply(get_new_rows).droplevel(0)]).sort_index().reset_index(drop=True)
# Fix the description
fix_m = complete_df['tab_name'].notna()
complete_df.loc[fix_m, 'Description'] = ('Tab ' + complete_df.loc[fix_m, 'tab_name'] +
' of type ' + complete_df.loc[fix_m, 'tab_type'] +
' ' + complete_df.loc[fix_m, 'tab_action'] + ' by ' +
complete_df.loc[fix_m, 'user_id'])
# Drop utility columns.
complete_df = complete_df.drop(columns=['tab_name', 'tab_type', 'tab_action', 'user_id'])
结果:
Time User Description
0 27.10.2021 15:58:00 UserA@gmail.com Tab Alpha of type PARTSTUDIO opened by User A
1 27.10.2021 15:59:00 UserA@gmail.com Start edit of part studio feature
2 27.10.2021 15:59:00 UserA@gmail.com Cancel Operation
3 27.10.2021 15:59:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO opened by User B
4 27.10.2021 15:59:00 UserB@gmail.com Start edit of part studio feature
5 27.10.2021 16:03:00 UserB@gmail.com Cancel Operation
6 27.10.2021 16:03:00 UserA@gmail.com Add assembly feature
7 27.10.2021 16:03:00 UserA@gmail.com Tab Alpha of type PARTSTUDIO closed by User A
8 27.10.2021 16:03:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
9 27.10.2021 16:03:00 UserA@gmail.com Tab Beta of type PARTSTUDIO closed by User A
10 27.10.2021 16:03:00 UserA@gmail.com Tab Gamma of type PARTSTUDIO opened by User A
11 27.10.2021 16:14:00 UserA@gmail.com Tab Gamma of type PARTSTUDIO closed by User A
12 27.10.2021 16:14:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
13 27.10.2021 16:15:00 UserA@gmail.com Start edit of part studio feature
14 27.10.2021 16:15:00 UserB@gmail.com Start edit of part studio feature
15 27.10.2021 16:15:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
16 27.10.2021 16:15:00 UserB@gmail.com Tab Delta of type PARTSTUDIO opened by User B
17 27.10.2021 16:54:00 UserB@gmail.com Add assembly feature
18 27.10.2021 16:55:00 UserA@gmail.com Tab Beta of type PARTSTUDIO closed by User A
19 27.10.2021 16:55:00 UserB@gmail.com Start edit of part studio feature
20 27.10.2021 16:55:00 UserB@gmail.com Tab Delta of type PARTSTUDIO closed by User B
21 27.10.2021 16:55:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO opened by User B
22 27.10.2021 16:56:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
23 27.10.2021 16:56:00 UserB@gmail.com Tab Beta of type PARTSTUDIO opened by User B
24 27.10.2021 16:57:00 UserB@gmail.com Tab Beta of type PARTSTUDIO closed by User B
I have a following dataframe:
Index | Time | User | Description |
---|---|---|---|
1 | 27.10.2021 15:58:00 | UserA@gmail.com | Tab Alpha of type PARTSTUDIO opened by User A |
2 | 27.10.2021 15:59:00 | UserA@gmail.com | Start edit of part studio feature |
3 | 27.10.2021 15:59:00 | UserA@gmail.com | Cancel Operation |
4 | 27.10.2021 15:59:00 | UserB@gmail.com | Tab Alpha of type PARTSTUDIO opened by User B |
5 | 27.10.2021 15:59:00 | UserB@gmail.com | Start edit of part studio feature |
6 | 27.10.2021 16:03:00 | UserB@gmail.com | Cancel Operation |
7 | 27.10.2021 16:03:00 | UserA@gmail.com | Add assembly feature |
9 | 27.10.2021 16:03:00 | UserA@gmail.com | Tab Beta of type PARTSTUDIO opened by User A |
10 | 27.10.2021 16:15:00 | UserA@gmail.com | Start edit of part studio feature |
11 | 27.10.2021 16:15:00 | UserB@gmail.com | Start edit of part studio feature |
12 | 27.10.2021 16:15:00 | UserB@gmail.com | Tab Alpha of type PARTSTUDIO closed by User B |
14 | 27.10.2021 16:54:00 | UserB@gmail.com | Add assembly feature |
15 | 27.10.2021 16:55:00 | UserA@gmail.com | Tab Beta of type PARTSTUDIO closed by User A |
16 | 27.10.2021 16:55:00 | UserB@gmail.com | Start edit of part studio feature |
17 | 27.10.2021 16:55:00 | UserB@gmail.com | Tab Delta of type PARTSTUDIO closed by User B |
预期输出:
Index | Time | User | Description |
---|---|---|---|
1 | 27.10.2021 15:58:00 | UserA@gmail.com | Tab Alpha of type PARTSTUDIO opened by User A |
2 | 27.10.2021 15:59:00 | UserA@gmail.com | Start edit of part studio feature |
3 | 27.10.2021 15:59:00 | UserA@gmail.com | Cancel Operation |
4 | 27.10.2021 15:59:00 | UserB@gmail.com | Tab Alpha of type PARTSTUDIO opened by User B |
5 | 27.10.2021 15:59:00 | UserB@gmail.com | Start edit of part studio feature |
6 | 27.10.2021 16:03:00 | UserB@gmail.com | Cancel Operation |
7 | 27.10.2021 16:03:00 | UserA@gmail.com | Add assembly feature |
8 | 27.10.2021 16:03:00 | UserA@gmail.com | Tab Alpha of type PARTSTUDIO closed by User A |
9 | 27.10.2021 16:03:00 | UserA@gmail.com | Tab Beta of type PARTSTUDIO opened by User A |
10 | 27.10.2021 16:15:00 | UserA@gmail.com | Start edit of part studio feature |
11 | 27.10.2021 16:15:00 | UserB@gmail.com | Start edit of part studio feature |
12 | 27.10.2021 16:15:00 | UserB@gmail.com | Tab Alpha of type PARTSTUDIO closed by User B |
13 | 27.10.2021 16:15:00 | UserB@gmail.com | Tab Delta of type PARTSTUDIO opened by User B |
14 | 27.10.2021 16:54:00 | UserB@gmail.com | Add assembly feature |
15 | 27.10.2021 16:55:00 | UserA@gmail.com | Tab Beta of type PARTSTUDIO closed by User A |
16 | 27.10.2021 16:55:00 | UserB@gmail.com | Start edit of part studio feature |
17 | 27.10.2021 16:55:00 | UserB@gmail.com | Tab Delta of type PARTSTUDIO closed by User B |
如何遍历数据框并检查描述列中每个值“Tab x opened by User y”之后,“Tab x closed by User y" 在数据框中更远的地方?如果是,可以。如果不是,如果后面是“Tab zz opened by User A”,这意味着“Tab x closed by User y”丢失并且应该在“Tab zz opened by User A”值之前插入一行(示例索引 8)。反之亦然(索引 13)。没有 df.iterrows 有没有办法做到这一点?提前致谢。
抱歉,我忘了回答这个问题。
这是一种解决方案。不是很简洁也不是特别优雅,但应该比使用 iterrows
来修改和检查未来的行更快。
数据:
Time User Description
0 27.10.2021 15:58:00 UserA@gmail.com Tab Alpha of type PARTSTUDIO opened by User A
1 27.10.2021 15:59:00 UserA@gmail.com Start edit of part studio feature
2 27.10.2021 15:59:00 UserA@gmail.com Cancel Operation
3 27.10.2021 15:59:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO opened by User B
4 27.10.2021 15:59:00 UserB@gmail.com Start edit of part studio feature
5 27.10.2021 16:03:00 UserB@gmail.com Cancel Operation
6 27.10.2021 16:03:00 UserA@gmail.com Add assembly feature
7 27.10.2021 16:03:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
8 27.10.2021 16:03:00 UserA@gmail.com Tab Gamma of type PARTSTUDIO opened by User A
9 27.10.2021 16:14:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
10 27.10.2021 16:15:00 UserA@gmail.com Start edit of part studio feature
11 27.10.2021 16:15:00 UserB@gmail.com Start edit of part studio feature
12 27.10.2021 16:15:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
13 27.10.2021 16:54:00 UserB@gmail.com Add assembly feature
14 27.10.2021 16:55:00 UserA@gmail.com Tab Beta of type PARTSTUDIO closed by User A
15 27.10.2021 16:55:00 UserB@gmail.com Start edit of part studio feature
16 27.10.2021 16:55:00 UserB@gmail.com Tab Delta of type PARTSTUDIO closed by User B
17 27.10.2021 16:56:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
18 27.10.2021 16:57:00 UserB@gmail.com Tab Beta of type PARTSTUDIO closed by User B
我确实连续添加了几个 open/close 以进行更多测试。
代码:
# Pattern to extract action info.
pattern = r'^Tab (?P<tab_name>.+) of type (?P<tab_type>.+) (?P<tab_action>\bclosed\b|\bopened\b) by (?P<user_id>.+)$'
# Add utility columns.
df = pd.concat([df, df['Description'].str.extract(pattern)], axis=1)
# Get rows with tweaked index.
def get_new_rows(df):
all_values = []
for action in ['opened', 'closed']:
action_mask = df['tab_action'].eq(action)
first_tabs = df[df['tab_action'].eq(df['tab_action'].shift(-1)) & action_mask]
second_tabs = df[df['tab_action'].eq(df['tab_action'].shift(1)) & action_mask]
if len(first_tabs) == 0:
continue
if action == 'opened':
values_tab, index_tab, offset, new_action = first_tabs, second_tabs, -0.5, 'closed'
elif action == 'closed':
values_tab, index_tab, offset, new_action = second_tabs, first_tabs, 0.5, 'opened'
values_tab.index = index_tab.index + offset
values_tab['Time'] = index_tab['Time'].to_numpy()
values_tab['tab_action'] = new_action
all_values.append(values_tab)
last_action = df.tail(1)
if last_action['tab_action'].iat[0] == 'opened':
last_action.index += 0.5
last_action['tab_action'] = 'closed'
all_values.append(last_action)
return pd.concat(all_values)
# Add new rows at the correct positions.
complete_df = pd.concat([df, df.dropna(subset='tab_action').groupby(['user_id'], as_index=False).apply(get_new_rows).droplevel(0)]).sort_index().reset_index(drop=True)
# Fix the description
fix_m = complete_df['tab_name'].notna()
complete_df.loc[fix_m, 'Description'] = ('Tab ' + complete_df.loc[fix_m, 'tab_name'] +
' of type ' + complete_df.loc[fix_m, 'tab_type'] +
' ' + complete_df.loc[fix_m, 'tab_action'] + ' by ' +
complete_df.loc[fix_m, 'user_id'])
# Drop utility columns.
complete_df = complete_df.drop(columns=['tab_name', 'tab_type', 'tab_action', 'user_id'])
结果:
Time User Description
0 27.10.2021 15:58:00 UserA@gmail.com Tab Alpha of type PARTSTUDIO opened by User A
1 27.10.2021 15:59:00 UserA@gmail.com Start edit of part studio feature
2 27.10.2021 15:59:00 UserA@gmail.com Cancel Operation
3 27.10.2021 15:59:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO opened by User B
4 27.10.2021 15:59:00 UserB@gmail.com Start edit of part studio feature
5 27.10.2021 16:03:00 UserB@gmail.com Cancel Operation
6 27.10.2021 16:03:00 UserA@gmail.com Add assembly feature
7 27.10.2021 16:03:00 UserA@gmail.com Tab Alpha of type PARTSTUDIO closed by User A
8 27.10.2021 16:03:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
9 27.10.2021 16:03:00 UserA@gmail.com Tab Beta of type PARTSTUDIO closed by User A
10 27.10.2021 16:03:00 UserA@gmail.com Tab Gamma of type PARTSTUDIO opened by User A
11 27.10.2021 16:14:00 UserA@gmail.com Tab Gamma of type PARTSTUDIO closed by User A
12 27.10.2021 16:14:00 UserA@gmail.com Tab Beta of type PARTSTUDIO opened by User A
13 27.10.2021 16:15:00 UserA@gmail.com Start edit of part studio feature
14 27.10.2021 16:15:00 UserB@gmail.com Start edit of part studio feature
15 27.10.2021 16:15:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
16 27.10.2021 16:15:00 UserB@gmail.com Tab Delta of type PARTSTUDIO opened by User B
17 27.10.2021 16:54:00 UserB@gmail.com Add assembly feature
18 27.10.2021 16:55:00 UserA@gmail.com Tab Beta of type PARTSTUDIO closed by User A
19 27.10.2021 16:55:00 UserB@gmail.com Start edit of part studio feature
20 27.10.2021 16:55:00 UserB@gmail.com Tab Delta of type PARTSTUDIO closed by User B
21 27.10.2021 16:55:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO opened by User B
22 27.10.2021 16:56:00 UserB@gmail.com Tab Alpha of type PARTSTUDIO closed by User B
23 27.10.2021 16:56:00 UserB@gmail.com Tab Beta of type PARTSTUDIO opened by User B
24 27.10.2021 16:57:00 UserB@gmail.com Tab Beta of type PARTSTUDIO closed by User B