使用条件测量时间戳之间的差异 - python

Measure different between timestamps using conditions - python

我正在尝试使用特定条件测量时间戳之间的差异。使用下面,对于每个唯一 ID,我希望减去 End Time 其中 Item == AStart Time 其中 Item == D.

所以时间戳实际上位于不同的行。

目前我的进程正在返回一个错误。我也希望放弃 .shift() 以获得更强大的东西,因为每个独特的 ID 将有不同的组合。例如,A,B,C,D - A,B,D - A,D

df = pd.DataFrame({'ID': [10,10,10,20,20,30],
               'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'], 
               'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'], 
               'Item': ['A','B','D','A','D','A'],
                })

df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])

df['diff'] = (df.groupby('ID')
                .apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
                .reset_index(drop=True))

预期输出:

   ID          Start Time            End Time Item            diff
0  10 2019-08-02 09:00:00 2019-08-04 15:00:00    A             NaT
1  10 2019-08-03 10:50:00 2019-08-04 16:00:00    B             NaT
2  10 2019-08-05 16:00:00 2019-08-05 16:00:00    D 1 days 01:00:00
3  20 2019-08-04 08:00:00 2019-08-04 14:00:00    A             NaT
4  20 2019-08-04 15:30:00 2019-08-05 20:30:00    D 0 days 01:30:00
5  30 2019-08-06 11:00:00 2019-08-07 10:00:00    A             NaT

df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']

输出:

ID
10   2 days 05:30:00
20   0 days 20:30:00
30               NaT
dtype: timedelta64[ns]

较早的回答

问题出在您的 fillna,timedelta 列中不能有字符串:

df['diff'] = (df.groupby('ID')
                .apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
                #.fillna('-')  # the issue is here
                .reset_index(drop=True))

输出:

   ID          Start Time            End Time Item            diff
0  10 2019-08-02 09:00:00 2019-08-02 09:30:00    A             NaT
1  10 2019-08-03 10:50:00 2019-08-03 11:00:00    B 0 days 00:30:00
2  10 2019-08-04 15:00:00 2019-08-05 16:00:00    C 0 days 00:10:00
3  20 2019-08-04 08:00:00 2019-08-04 14:00:00    B             NaT
4  20 2019-08-05 10:30:00 2019-08-05 20:30:00    C 0 days 06:00:00
5  30 2019-08-06 11:00:00 2019-08-07 10:00:00    A             NaT

IIUC 使用:

df1 = df.pivot('ID','Item')
print (df1)
              Start Time                                          \
Item                   A                   B                   D   
ID                                                                 
10   2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00   
20   2019-08-04 08:00:00                 NaT 2019-08-05 10:30:00   
30   2019-08-06 11:00:00                 NaT                 NaT   

                End Time                                          
Item                   A                   B                   D  
ID                                                                
10   2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00  
20   2019-08-04 14:00:00                 NaT 2019-08-05 20:30:00  
30   2019-08-07 10:00:00                 NaT                 NaT  

a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10   2 days 05:30:00
20   0 days 20:30:00
30               NaT
dtype: timedelta64[ns]