在 pandas 数据帧中计算分区
calculating over partition in pandas dataframe
我有一个 table 像这样:
ID Timestamp Status
A 5/30/2022 2:29 Run Ended
A 5/30/2022 0:23 In Progress
A 5/30/2022 0:22 Prepared
B 5/30/2022 11:15 Run Ended
B 5/30/2022 9:18 In Progress
B 5/30/2022 0:55 Prepared
我想计算按 ID 分组的每个状态之间的持续时间。
所以结果输出 table 将是:
ID Duration(min) Status change
A 0.40 In Progress-Prepared
A 125.82 Run Ended - In Progress
B 502.78 In Progress-Prepared
B 117.34 Run Ended - In Progress
如何按降序时间戳(按 ID 分组)对其进行排序,然后从每个 ID 组的前一行一直减去最后一行一直到顶部?
您可以使用 groupby('ID')[value].shift(1)
访问同一 ID
组中的前一个 value
。
import pandas as pd
df = pd.DataFrame({
'ID': ['a','a','a','b','b','b'],
'time': [1,2,3,1,4,5],
'status': ['x','y','z','xx','yy','zz']
})
df['previous_time'] = df.groupby('ID')['time'].shift(1)
df['previous_status'] = df.groupby('ID')['status'].shift(1)
df = df.dropna()
df['duration'] = df['time'] - df['previous_time'] # change this line to calculate duration between time instead
df['status_change'] = df['previous_status'] + '-' + df['status']
print (df[['ID','duration','status_change']].to_markdown(index=False))
输出:
ID
duration
status_change
a
1
x-y
a
1
y-z
b
3
xx-yy
b
1
yy-zz
PS。你可以用 this thread
中的答案减去 time
和 previous_time
您可以使用 groupby.diff
和 groupby.shift
:
out = (df
.assign(**{'Duration(min)': pd.to_datetime(df['Timestamp'], dayfirst=False)
.groupby(df['ID'])
.diff(-1).dt.total_seconds() # diff in seconds to next time in group
.div(60), # convert to minutes
'Status change': df.groupby('ID')['Status'].shift(-1)+'-'+df['Status']
})
.dropna(subset='Duration(min)') # get rid of empty rows
[['ID', 'Duration(min)', 'Status change']]
)
输出:
ID Duration(min) Status change
0 A 126.0 In Progress-Run Ended
1 A 1.0 Prepared-In Progress
3 B 117.0 In Progress-Run Ended
4 B 503.0 Prepared-In Progress
我有一个 table 像这样:
ID Timestamp Status
A 5/30/2022 2:29 Run Ended
A 5/30/2022 0:23 In Progress
A 5/30/2022 0:22 Prepared
B 5/30/2022 11:15 Run Ended
B 5/30/2022 9:18 In Progress
B 5/30/2022 0:55 Prepared
我想计算按 ID 分组的每个状态之间的持续时间。 所以结果输出 table 将是:
ID Duration(min) Status change
A 0.40 In Progress-Prepared
A 125.82 Run Ended - In Progress
B 502.78 In Progress-Prepared
B 117.34 Run Ended - In Progress
如何按降序时间戳(按 ID 分组)对其进行排序,然后从每个 ID 组的前一行一直减去最后一行一直到顶部?
您可以使用 groupby('ID')[value].shift(1)
访问同一 ID
组中的前一个 value
。
import pandas as pd
df = pd.DataFrame({
'ID': ['a','a','a','b','b','b'],
'time': [1,2,3,1,4,5],
'status': ['x','y','z','xx','yy','zz']
})
df['previous_time'] = df.groupby('ID')['time'].shift(1)
df['previous_status'] = df.groupby('ID')['status'].shift(1)
df = df.dropna()
df['duration'] = df['time'] - df['previous_time'] # change this line to calculate duration between time instead
df['status_change'] = df['previous_status'] + '-' + df['status']
print (df[['ID','duration','status_change']].to_markdown(index=False))
输出:
ID | duration | status_change |
---|---|---|
a | 1 | x-y |
a | 1 | y-z |
b | 3 | xx-yy |
b | 1 | yy-zz |
PS。你可以用 this thread
中的答案减去time
和 previous_time
您可以使用 groupby.diff
和 groupby.shift
:
out = (df
.assign(**{'Duration(min)': pd.to_datetime(df['Timestamp'], dayfirst=False)
.groupby(df['ID'])
.diff(-1).dt.total_seconds() # diff in seconds to next time in group
.div(60), # convert to minutes
'Status change': df.groupby('ID')['Status'].shift(-1)+'-'+df['Status']
})
.dropna(subset='Duration(min)') # get rid of empty rows
[['ID', 'Duration(min)', 'Status change']]
)
输出:
ID Duration(min) Status change
0 A 126.0 In Progress-Run Ended
1 A 1.0 Prepared-In Progress
3 B 117.0 In Progress-Run Ended
4 B 503.0 Prepared-In Progress