考虑到其他字段,从 pandas 数据框中获取重叠的日期时间
Get overlapping datetimes from pandas dataframe, considering other field
我有一个 pandas 数据框如下
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
我想检查这些 [ts_start、ts_end] 区间中哪些是重叠的,对于同一台机器。我看到了一些关于查找重叠的问题,但找不到按另一列分组的问题,在我的例子中,分别考虑了每台机器的重叠。
我尝试使用 Piso,这看起来很有趣。
df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])
ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
我得到了这样的东西:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 1
但是,它正在同时考虑所有机器。有没有办法(使用或不使用 piso)在单个数据帧中为每台机器获取重叠时刻?
这里有一种方法可以完成您的问题:
import pandas as pd
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
if len(x.index) > 1:
iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
x.loc[iPrev, 'isOverlap'] = 0
for i in x.index[1:]:
if x.loc[i,'ts_start'] < reachOfPrev:
x.loc[iPrev, 'isOverlap'] = 1
x.loc[i, 'isOverlap'] = 1
else:
x.loc[i, 'isOverlap'] = 0
if x.loc[i, 'ts_end'] > reachOfPrev:
iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
else:
x['isOverlap'] = 0
x.isOverlap = x.isOverlap.astype(int)
return x
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)
输入:
machine ts_start ts_end
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00
输出:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
假设仅按分钟检查重叠,您可以尝试:
#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)
#explode to get one row per minute
df_sample = df_sample.explode("times")
#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)
#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()
>>> output
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
piso确实可以用。它在大型数据集上会 运行 快速,并且不限于对时间采样率的假设。修改你的 piso 示例以将最后两行包装在一个函数中:
def make_overlaps(df):
ii = pd.IntervalIndex.from_arrays(df["ts_start"], df["ts_end"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
return df
然后在机器一栏分组df_sample
,申请:
df_sample.groupby("machine").apply(make_overlaps)
这会给你:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
我有一个 pandas 数据框如下
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
我想检查这些 [ts_start、ts_end] 区间中哪些是重叠的,对于同一台机器。我看到了一些关于查找重叠的问题,但找不到按另一列分组的问题,在我的例子中,分别考虑了每台机器的重叠。
我尝试使用 Piso,这看起来很有趣。
df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])
ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
我得到了这样的东西:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 1
但是,它正在同时考虑所有机器。有没有办法(使用或不使用 piso)在单个数据帧中为每台机器获取重叠时刻?
这里有一种方法可以完成您的问题:
import pandas as pd
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
if len(x.index) > 1:
iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
x.loc[iPrev, 'isOverlap'] = 0
for i in x.index[1:]:
if x.loc[i,'ts_start'] < reachOfPrev:
x.loc[iPrev, 'isOverlap'] = 1
x.loc[i, 'isOverlap'] = 1
else:
x.loc[i, 'isOverlap'] = 0
if x.loc[i, 'ts_end'] > reachOfPrev:
iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
else:
x['isOverlap'] = 0
x.isOverlap = x.isOverlap.astype(int)
return x
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)
输入:
machine ts_start ts_end
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00
输出:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
假设仅按分钟检查重叠,您可以尝试:
#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)
#explode to get one row per minute
df_sample = df_sample.explode("times")
#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)
#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()
>>> output
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
piso确实可以用。它在大型数据集上会 运行 快速,并且不限于对时间采样率的假设。修改你的 piso 示例以将最后两行包装在一个函数中:
def make_overlaps(df):
ii = pd.IntervalIndex.from_arrays(df["ts_start"], df["ts_end"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
return df
然后在机器一栏分组df_sample
,申请:
df_sample.groupby("machine").apply(make_overlaps)
这会给你:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0