在 pandas DataFrame 上获取距给定日期和特定条件有时间间隔的记录

Get records that are a time interval away from a given date and specific conditions on a pandas DataFrame

让它成为下面的Python Panda DataFrame:

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
|   0       |    2022-04-01 10:00:01  |    IN         |    UK      |
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |
|   0       |    2022-04-01 12:00:01  |    OUT        |    UK      |
|   0       |    2022-04-01 12:30:11  |    IN         |    GER     |
|   1       |    2022-04-01 10:00:00  |    IN         |    GER     |
|   1       |    2022-04-01 08:04:03  |    OUT        |    GER     |
| unknown   |    2022-04-01 10:20:02  |    OUT        |    USA     |
| unknown   |    2022-04-01 09:59:58  |    IN         |    GER     |
| unknown   |    2022-04-01 05:04:03  |    OUT        |    ITL     |
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |
|   2       |    2022-04-01 05:03:59  |    OUT        |    ITL     |

我需要创建一个 DataFrame,其中包含 ID 值未知的行,这些行具有方向匹配的记录和 country_ID 值在时间上相隔 2 秒(可以更改),但是 ID它匹配的行与未知行不同。

所有行未知:

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |
| unknown   |    2022-04-01 10:20:02  |    OUT        |    USA     |
| unknown   |    2022-04-01 09:59:58  |    IN         |    GER     |
| unknown   |    2022-04-01 05:04:03  |    OUT        |    ITL     |
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |

上面指定的每一行的匹配示例:

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |
|   0       |    2022-04-01 10:00:01  |    IN         |    UK      |
|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 10:20:02  |    OUT        |    USA     |
|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 09:59:59  |    IN         |    GER     |
|   1       |    2022-04-01 10:00:00  |    IN         |    GER     |
|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 05:04:03  |    OUT        |    ITL     |
|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |
| 2         |    2022-04-01 05:03:59  |    OUT        |    ITL     |

我们删除那些没有任何匹配的。我们得到结果 DataFrame:

|   ID      |    date                 |  direction    | country_ID |  date_match          |   ID_match    |
|-----------|-------------------------|---------------|------------|----------------------|---------------|
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |  2022-04-01 10:00:01 |    0          |
| unknown   |    2022-04-01 09:59:58  |    IN         |    GER     |  2022-04-01 10:00:00 |    1          |
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |  2022-04-01 05:03:59 |    2          |

预先感谢您的帮助。

您可以使用掩码将数据帧一分为二,然后 pandas.merge_asof 在 2 秒内按组查找匹配项:

df['date'] = pd.to_datetime(df['date'])

mask = df['ID'].eq('unknown')

idx = (pd
 .merge_asof(df[mask].sort_values(by='date').reset_index(),
             df[~mask].sort_values(by='date'),
             by=['direction', 'country_ID'],
             on='date',
             direction='nearest', tolerance=pd.Timedelta('2s'),
             )
 .loc[lambda d: d['ID_y'].notna(), 'index']
)

df.loc[sorted(idx)]

输出:

        ID                date direction country_ID
1  unknown 2022-04-01 10:00:03        IN         UK
7  unknown 2022-04-01 09:59:58        IN        GER
9  unknown 2022-04-01 05:04:01       OUT        ITL
合并数据
df2 = (pd
 .merge_asof(df[mask].sort_values(by='date').reset_index(),
             df[~mask].sort_values(by='date').rename(columns={'date': 'date_match'}),
             by=['direction', 'country_ID'],
             left_on='date', right_on='date_match',
             direction='nearest', tolerance=pd.Timedelta('2s'),
             suffixes=('', '_match')
             )
 .loc[lambda d: d['ID_match'].notna()]
 .set_index('index').sort_index()

)

输出:

            ID                date direction country_ID ID_match          date_match
index                                                                               
1      unknown 2022-04-01 10:00:03        IN         UK        0 2022-04-01 10:00:01
7      unknown 2022-04-01 09:59:58        IN        GER        1 2022-04-01 10:00:00
9      unknown 2022-04-01 05:04:01       OUT        ITL        2 2022-04-01 05:03:59