在 pandas DataFrame 上获取距给定日期和特定条件有时间间隔的记录
Get records that are a time interval away from a given date and specific conditions on a pandas DataFrame
让它成为下面的Python Panda DataFrame:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| 0 | 2022-04-01 10:00:01 | IN | UK |
| unknown | 2022-04-01 10:00:03 | IN | UK |
| 0 | 2022-04-01 12:00:01 | OUT | UK |
| 0 | 2022-04-01 12:30:11 | IN | GER |
| 1 | 2022-04-01 10:00:00 | IN | GER |
| 1 | 2022-04-01 08:04:03 | OUT | GER |
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| unknown | 2022-04-01 09:59:58 | IN | GER |
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
| 2 | 2022-04-01 05:03:59 | OUT | ITL |
我需要创建一个 DataFrame,其中包含 ID 值未知的行,这些行具有方向匹配的记录和 country_ID 值在时间上相隔 2 秒(可以更改),但是 ID它匹配的行与未知行不同。
所有行未知:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:00:03 | IN | UK |
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| unknown | 2022-04-01 09:59:58 | IN | GER |
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
上面指定的每一行的匹配示例:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:00:03 | IN | UK |
| 0 | 2022-04-01 10:00:01 | IN | UK |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 09:59:59 | IN | GER |
| 1 | 2022-04-01 10:00:00 | IN | GER |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
| 2 | 2022-04-01 05:03:59 | OUT | ITL |
我们删除那些没有任何匹配的。我们得到结果 DataFrame:
| ID | date | direction | country_ID | date_match | ID_match |
|-----------|-------------------------|---------------|------------|----------------------|---------------|
| unknown | 2022-04-01 10:00:03 | IN | UK | 2022-04-01 10:00:01 | 0 |
| unknown | 2022-04-01 09:59:58 | IN | GER | 2022-04-01 10:00:00 | 1 |
| unknown | 2022-04-01 05:04:01 | OUT | ITL | 2022-04-01 05:03:59 | 2 |
预先感谢您的帮助。
您可以使用掩码将数据帧一分为二,然后 pandas.merge_asof
在 2 秒内按组查找匹配项:
df['date'] = pd.to_datetime(df['date'])
mask = df['ID'].eq('unknown')
idx = (pd
.merge_asof(df[mask].sort_values(by='date').reset_index(),
df[~mask].sort_values(by='date'),
by=['direction', 'country_ID'],
on='date',
direction='nearest', tolerance=pd.Timedelta('2s'),
)
.loc[lambda d: d['ID_y'].notna(), 'index']
)
df.loc[sorted(idx)]
输出:
ID date direction country_ID
1 unknown 2022-04-01 10:00:03 IN UK
7 unknown 2022-04-01 09:59:58 IN GER
9 unknown 2022-04-01 05:04:01 OUT ITL
合并数据
df2 = (pd
.merge_asof(df[mask].sort_values(by='date').reset_index(),
df[~mask].sort_values(by='date').rename(columns={'date': 'date_match'}),
by=['direction', 'country_ID'],
left_on='date', right_on='date_match',
direction='nearest', tolerance=pd.Timedelta('2s'),
suffixes=('', '_match')
)
.loc[lambda d: d['ID_match'].notna()]
.set_index('index').sort_index()
)
输出:
ID date direction country_ID ID_match date_match
index
1 unknown 2022-04-01 10:00:03 IN UK 0 2022-04-01 10:00:01
7 unknown 2022-04-01 09:59:58 IN GER 1 2022-04-01 10:00:00
9 unknown 2022-04-01 05:04:01 OUT ITL 2 2022-04-01 05:03:59
让它成为下面的Python Panda DataFrame:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| 0 | 2022-04-01 10:00:01 | IN | UK |
| unknown | 2022-04-01 10:00:03 | IN | UK |
| 0 | 2022-04-01 12:00:01 | OUT | UK |
| 0 | 2022-04-01 12:30:11 | IN | GER |
| 1 | 2022-04-01 10:00:00 | IN | GER |
| 1 | 2022-04-01 08:04:03 | OUT | GER |
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| unknown | 2022-04-01 09:59:58 | IN | GER |
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
| 2 | 2022-04-01 05:03:59 | OUT | ITL |
我需要创建一个 DataFrame,其中包含 ID 值未知的行,这些行具有方向匹配的记录和 country_ID 值在时间上相隔 2 秒(可以更改),但是 ID它匹配的行与未知行不同。
所有行未知:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:00:03 | IN | UK |
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| unknown | 2022-04-01 09:59:58 | IN | GER |
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
上面指定的每一行的匹配示例:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:00:03 | IN | UK |
| 0 | 2022-04-01 10:00:01 | IN | UK |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 09:59:59 | IN | GER |
| 1 | 2022-04-01 10:00:00 | IN | GER |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
| 2 | 2022-04-01 05:03:59 | OUT | ITL |
我们删除那些没有任何匹配的。我们得到结果 DataFrame:
| ID | date | direction | country_ID | date_match | ID_match |
|-----------|-------------------------|---------------|------------|----------------------|---------------|
| unknown | 2022-04-01 10:00:03 | IN | UK | 2022-04-01 10:00:01 | 0 |
| unknown | 2022-04-01 09:59:58 | IN | GER | 2022-04-01 10:00:00 | 1 |
| unknown | 2022-04-01 05:04:01 | OUT | ITL | 2022-04-01 05:03:59 | 2 |
预先感谢您的帮助。
您可以使用掩码将数据帧一分为二,然后 pandas.merge_asof
在 2 秒内按组查找匹配项:
df['date'] = pd.to_datetime(df['date'])
mask = df['ID'].eq('unknown')
idx = (pd
.merge_asof(df[mask].sort_values(by='date').reset_index(),
df[~mask].sort_values(by='date'),
by=['direction', 'country_ID'],
on='date',
direction='nearest', tolerance=pd.Timedelta('2s'),
)
.loc[lambda d: d['ID_y'].notna(), 'index']
)
df.loc[sorted(idx)]
输出:
ID date direction country_ID
1 unknown 2022-04-01 10:00:03 IN UK
7 unknown 2022-04-01 09:59:58 IN GER
9 unknown 2022-04-01 05:04:01 OUT ITL
合并数据
df2 = (pd
.merge_asof(df[mask].sort_values(by='date').reset_index(),
df[~mask].sort_values(by='date').rename(columns={'date': 'date_match'}),
by=['direction', 'country_ID'],
left_on='date', right_on='date_match',
direction='nearest', tolerance=pd.Timedelta('2s'),
suffixes=('', '_match')
)
.loc[lambda d: d['ID_match'].notna()]
.set_index('index').sort_index()
)
输出:
ID date direction country_ID ID_match date_match
index
1 unknown 2022-04-01 10:00:03 IN UK 0 2022-04-01 10:00:01
7 unknown 2022-04-01 09:59:58 IN GER 1 2022-04-01 10:00:00
9 unknown 2022-04-01 05:04:01 OUT ITL 2 2022-04-01 05:03:59