从序列的开头开始,我需要删除距离下一个实例不到 30 天的任何实例
Starting from the beginning of a sequence, I need to delete any instance that is less than 30 days from the next instance
我将从我的数据集开始:
patient_id event_description
A DiagnosisA 2016-01-15
A DiagnosisA 2016-02-10
A DiagnosisA 2016-04-20
A DiagnosisA 2016-06-02
B DiagnosisA 2016-08-15
B DiagnosisA 2016-08-20
B DiagnosisA 2016-09-20
B DiagnosisA 2016-10-30
C DiagnosisA 2016-10-15
C DiagnosisA 2016-11-20
C DiagnosisA 2016-11-25
C DiagnosisA 2016-12-30
基本上,我需要:
- 计算
event_description
的第一个实例与第二个实例之间的差异,看看差异是小于还是大于 30。如果小于 30,我将删除该实例。
- 每个
event_description
和 patient_id
的每个实例都需要执行此操作
最终样本数据集如下:
patient_id event_description
A DiagnosisA 2016-01-15
A DiagnosisA 2016-04-20
A DiagnosisA 2016-06-02
B DiagnosisA 2016-08-15
B DiagnosisA 2016-09-20
B DiagnosisA 2016-10-30
C DiagnosisA 2016-10-15
C DiagnosisA 2016-11-20
C DiagnosisA 2016-12-30
使用 groupby
和 diff
注意:将您的日期更改为日期时间df.value=pd.to_datetime(df.value)
df[~df.groupby('patient_id').value.diff().dt.days.lt(30)]
Out[754]:
patient_id event_description value
0 A DiagnosisA 2016-01-15
2 A DiagnosisA 2016-04-20
3 A DiagnosisA 2016-06-02
4 B DiagnosisA 2016-08-15
6 B DiagnosisA 2016-09-20
7 B DiagnosisA 2016-10-30
8 C DiagnosisA 2016-10-15
9 C DiagnosisA 2016-11-20
11 C DiagnosisA 2016-12-30
数据输入
df
Out[755]:
patient_id event_description value
0 A DiagnosisA 2016-01-15
1 A DiagnosisA 2016-02-10
2 A DiagnosisA 2016-04-20
3 A DiagnosisA 2016-06-02
4 B DiagnosisA 2016-08-15
5 B DiagnosisA 2016-08-20
6 B DiagnosisA 2016-09-20
7 B DiagnosisA 2016-10-30
8 C DiagnosisA 2016-10-15
9 C DiagnosisA 2016-11-20
10 C DiagnosisA 2016-11-25
11 C DiagnosisA 2016-12-30
我将从我的数据集开始:
patient_id event_description
A DiagnosisA 2016-01-15
A DiagnosisA 2016-02-10
A DiagnosisA 2016-04-20
A DiagnosisA 2016-06-02
B DiagnosisA 2016-08-15
B DiagnosisA 2016-08-20
B DiagnosisA 2016-09-20
B DiagnosisA 2016-10-30
C DiagnosisA 2016-10-15
C DiagnosisA 2016-11-20
C DiagnosisA 2016-11-25
C DiagnosisA 2016-12-30
基本上,我需要:
- 计算
event_description
的第一个实例与第二个实例之间的差异,看看差异是小于还是大于 30。如果小于 30,我将删除该实例。 - 每个
event_description
和patient_id
的每个实例都需要执行此操作
最终样本数据集如下:
patient_id event_description
A DiagnosisA 2016-01-15
A DiagnosisA 2016-04-20
A DiagnosisA 2016-06-02
B DiagnosisA 2016-08-15
B DiagnosisA 2016-09-20
B DiagnosisA 2016-10-30
C DiagnosisA 2016-10-15
C DiagnosisA 2016-11-20
C DiagnosisA 2016-12-30
使用 groupby
和 diff
注意:将您的日期更改为日期时间df.value=pd.to_datetime(df.value)
df[~df.groupby('patient_id').value.diff().dt.days.lt(30)]
Out[754]:
patient_id event_description value
0 A DiagnosisA 2016-01-15
2 A DiagnosisA 2016-04-20
3 A DiagnosisA 2016-06-02
4 B DiagnosisA 2016-08-15
6 B DiagnosisA 2016-09-20
7 B DiagnosisA 2016-10-30
8 C DiagnosisA 2016-10-15
9 C DiagnosisA 2016-11-20
11 C DiagnosisA 2016-12-30
数据输入
df
Out[755]:
patient_id event_description value
0 A DiagnosisA 2016-01-15
1 A DiagnosisA 2016-02-10
2 A DiagnosisA 2016-04-20
3 A DiagnosisA 2016-06-02
4 B DiagnosisA 2016-08-15
5 B DiagnosisA 2016-08-20
6 B DiagnosisA 2016-09-20
7 B DiagnosisA 2016-10-30
8 C DiagnosisA 2016-10-15
9 C DiagnosisA 2016-11-20
10 C DiagnosisA 2016-11-25
11 C DiagnosisA 2016-12-30