从序列的开头开始,我需要删除距离下一个实例不到 30 天的任何实例

Starting from the beginning of a sequence, I need to delete any instance that is less than 30 days from the next instance

我将从我的数据集开始:

patient_id                                event_description     
            A                                             DiagnosisA          2016-01-15
            A                                             DiagnosisA          2016-02-10
            A                                             DiagnosisA          2016-04-20
            A                                             DiagnosisA          2016-06-02
            B                                             DiagnosisA          2016-08-15
            B                                             DiagnosisA          2016-08-20
            B                                             DiagnosisA          2016-09-20
            B                                             DiagnosisA          2016-10-30
            C                                             DiagnosisA          2016-10-15
            C                                             DiagnosisA          2016-11-20
            C                                             DiagnosisA          2016-11-25
            C                                             DiagnosisA          2016-12-30

基本上,我需要:

最终样本数据集如下:

patient_id                                event_description     
            A                                             DiagnosisA          2016-01-15
            A                                             DiagnosisA          2016-04-20
            A                                             DiagnosisA          2016-06-02
            B                                             DiagnosisA          2016-08-15
            B                                             DiagnosisA          2016-09-20
            B                                             DiagnosisA          2016-10-30
            C                                             DiagnosisA          2016-10-15
            C                                             DiagnosisA          2016-11-20
            C                                             DiagnosisA          2016-12-30

使用 groupbydiff

注意:将您的日期更改为日期时间df.value=pd.to_datetime(df.value)

df[~df.groupby('patient_id').value.diff().dt.days.lt(30)]
Out[754]: 
   patient_id event_description      value
0           A        DiagnosisA 2016-01-15
2           A        DiagnosisA 2016-04-20
3           A        DiagnosisA 2016-06-02
4           B        DiagnosisA 2016-08-15
6           B        DiagnosisA 2016-09-20
7           B        DiagnosisA 2016-10-30
8           C        DiagnosisA 2016-10-15
9           C        DiagnosisA 2016-11-20
11          C        DiagnosisA 2016-12-30

数据输入

df
Out[755]: 
   patient_id event_description      value
0           A        DiagnosisA 2016-01-15
1           A        DiagnosisA 2016-02-10
2           A        DiagnosisA 2016-04-20
3           A        DiagnosisA 2016-06-02
4           B        DiagnosisA 2016-08-15
5           B        DiagnosisA 2016-08-20
6           B        DiagnosisA 2016-09-20
7           B        DiagnosisA 2016-10-30
8           C        DiagnosisA 2016-10-15
9           C        DiagnosisA 2016-11-20
10          C        DiagnosisA 2016-11-25
11          C        DiagnosisA 2016-12-30