比较数据框中的日期并将值分配给另一个变量

compare dates within a dataframe and assign a value to another variable

我有两个数据帧(df 和 df1),如下所示

df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
                        'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM',                                        '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']

df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]

我想做的是

a) 从 df1 中挑选每个在 'within_id' 列中没有 NA 的人,并检查他们的 date_1 是否在 (df.start_date - 1) and (df.end_date + 1 之间) df 中的同一个人和 within_idenc_id

中的同一个人

例如:对于主题 = 101 和 within_id = ABC,我们有 date_17/7/2013,您检查它们是否在 4/7/2013 之间( df.start_date - 1) 和 11/7/2013 (df.end_date + 1).

由于第一行比较本身给了我们结果,我们不必将我们的 date_1 与 df 中的其余记录进行比较 subject 101。如果不是,我们需要 find/scan 直到找到 date_1 落入的区间。

b) 如果找到日期区间,则将df中对应的enc_id分配给df1

中的within_id

c) 如果不是则分配,“超出范围”

我尝试了以下

t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]

我希望我的输出(另见屏幕截图底部的第 14 行)如下所示。 由于我打算将解决方案应用于大数据(4/5 百万条记录,可能有 5000-6000 条唯一性 person_ids),任何高效而优雅的解决方案都是有帮助的

   14      202     2012-12-13 11:00:00   NA

我使用了上面提供的 dfdf1

  • 基本方法是迭代df1并提取enc_id的匹配值。
  • 我添加了一个 'rule' 列,以显示每个值是如何填充的。

不幸的是,我无法重现预期的结果。也许通用方法会有用。

df1['rule'] = 0
for t in df1.itertuples():
        
    person = (t.person_id == df.person_id)
    b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
    c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
    d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
    e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
    
    if (m := person & b).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 1
        
    elif (m := person & c).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 10
        
    elif (m := person & d).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 100
        
    elif (m := person & e).any():
        df1.at[t.Index, 'within_id'] = 'out of range'
        df1.at[t.Index, 'rule'] += 1_000
    else:
        df1.at[t.Index, 'within_id'] = 'impossible!'
        df1.at[t.Index, 'rule'] += 10_000

df1['within_id'] = df1['within_id'].astype('Int64')

结果是:

print(df1)

    person_id              date_1              date_2    within_id  rule
0          11 1961-12-30 00:00:00 1962-01-01 00:00:00  11345678901     1
1          11 1962-01-30 00:00:00 1962-02-01 00:00:00  11345678902     1
2          12 1962-02-28 00:00:00 1962-03-02 00:00:00  34567892101   100
3          12 1989-07-29 00:00:00 1989-07-31 00:00:00  34567892101     1
4          12 1989-09-03 00:00:00 1989-09-05 00:00:00  34567892101    10
5          12 1989-10-02 00:00:00 1989-10-04 00:00:00  34567892103     1
6          12 1989-10-01 00:00:00 1989-10-03 00:00:00  34567892103     1
7          13 1999-03-29 00:00:00 1999-03-31 00:00:00  56432718901     1
8          13 1999-04-20 00:00:00 1999-04-22 00:00:00  56432718901    10
9          13 1999-06-02 00:00:00 1999-06-04 00:00:00  56432718904     1
10         13 1999-06-03 00:00:00 1999-06-05 00:00:00  56432718904     1
11         13 1999-07-29 00:00:00 1999-07-31 00:00:00  56432718905     1
12         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1
13         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1

我们来做:

d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
              on=['person_id', 'within_id'], how='left', indicator=True)

m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
                        d['end_date']   + pd.Timedelta(days=1))

d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]

详情:

离开 merge 数据框 df1dfperson_idwithin_id:

print(d)
    person_id              date_1 within_id          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00       ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
1         101 2013-07-07 11:20:00       ABC 2013-09-08 11:21:00 2013-09-13 11:21:00   ABC2       both
2         101 2013-07-07 11:20:00       ABC 2014-06-06 08:00:00 2014-06-11 08:00:00   ABC3       both
3         101 2013-07-07 11:20:00       ABC 2014-06-06 05:00:00 2014-06-11 10:00:00   DEF1       both
....
47        202 2012-12-18 10:00:00       DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
48        202 2012-12-18 10:00:00       DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
49        202 2013-12-19 11:00:00       NaN                 NaT                 NaT    NaN  left_only

创建一个布尔掩码 m 来表示 date_1 介于 df.start_date - 1 daysdf.end_date + 1 days 之间的条件:

print(m)
0     False
1     False
2     False
3     False
...
47    False
48     True
49    False
dtype: bool

再次离开 merge 数据帧 df1,在 person_iddate_1 列上使用掩码 m 过滤数据帧:

print(d)

    person_id              date_1 within_id_x within_id_y          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00         ABC         NaN                 NaT                 NaT    NaN        NaN
1         101 2013-05-07 14:30:00         ABC         ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
2         101 2013-06-07 14:40:00         ABC         NaN                 NaT                 NaT    NaN        NaN
3         101 2014-08-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
4         101 2014-11-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
5         101 2013-02-03 12:30:00         ABC         NaN                 NaT                 NaT    NaN        NaN
6         101 2014-06-13 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
7         202 2011-12-11 00:00:00         DEF         DEF 2011-12-11 10:00:00 2011-12-16 10:00:00   DEF1       both
8         202 2012-10-13 07:00:00         DEF         DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
9         202 2015-12-13 00:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
10        202 2012-12-13 00:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
11        202 2012-12-13 18:30:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
12        202 2011-07-13 10:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
13        202 2012-12-18 10:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
14        202 2013-12-19 11:00:00         NaN         NaN                 NaT                 NaT    NaN  left_only

enc_id 填充 within_id 列中的值并使用 Series.fillna 填充 NaN 排除与 df 不匹配的值out of range,最后过滤列得到结果:

print(d)
    person_id              date_1     within_id
0         101 2013-07-07 11:20:00  out of range
1         101 2013-05-07 14:30:00          ABC1
2         101 2013-06-07 14:40:00  out of range
3         101 2014-08-06 00:00:00  out of range
4         101 2014-11-06 00:00:00  out of range
5         101 2013-02-03 12:30:00  out of range
6         101 2014-06-13 00:00:00  out of range
7         202 2011-12-11 00:00:00          DEF1
8         202 2012-10-13 07:00:00          DEF2
9         202 2015-12-13 00:00:00  out of range
10        202 2012-12-13 00:00:00          DEF3
11        202 2012-12-13 18:30:00          DEF3
12        202 2011-07-13 10:00:00  out of range
13        202 2012-12-18 10:00:00          DEF3
14        202 2013-12-19 11:00:00           NaN