考虑到其他字段,从 pandas 数据框中获取重叠的日期时间

Get overlapping datetimes from pandas dataframe, considering other field

我有一个 pandas 数据框如下

df_sample = pd.DataFrame({
        'machine': [1, 1, 1, 2],
        'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
        'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})

我想检查这些 [ts_start、ts_end] 区间中哪些是重叠的,对于同一台机器。我看到了一些关于查找重叠的问题,但找不到按另一列分组的问题,在我的例子中,分别考虑了每台机器的重叠。

我尝试使用 Piso,这看起来很有趣。

df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])

ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values

我得到了这样的东西:

machine    ts_start             ts_end             isOverlap
0    1 2022-01-01 20:00:00 2022-01-01 21:00:00          1
1    1 2022-01-01 20:30:00 2022-01-01 21:30:00          1
2    1 2022-01-02 20:30:00 2022-01-02 20:35:00          0
3    2 2022-01-01 19:00:00 2022-01-01 23:00:00          1

但是,它正在同时考虑所有机器。有没有办法(使用或不使用 piso)在单个数据帧中为每台机器获取重叠时刻?

这里有一种方法可以完成您的问题:

import pandas as pd
df_sample = pd.DataFrame({
        'machine': [1, 1, 1, 2],
        'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
        'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
    if len(x.index) > 1:
        iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
        x.loc[iPrev, 'isOverlap'] = 0
        for i in x.index[1:]:
            if x.loc[i,'ts_start'] < reachOfPrev:
                x.loc[iPrev, 'isOverlap'] = 1
                x.loc[i, 'isOverlap'] = 1
            else:
                x.loc[i, 'isOverlap'] = 0
            if x.loc[i, 'ts_end'] > reachOfPrev:
                iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
    else:
        x['isOverlap'] = 0
    x.isOverlap = x.isOverlap.astype(int)
    return x
    
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)

输入:

   machine             ts_start               ts_end
0        1  2022-01-01 20:00:00  2022-01-01 21:00:00
1        1  2022-01-01 20:30:00  2022-01-01 21:30:00
2        1  2022-01-02 20:30:00  2022-01-02 20:35:00
3        2  2022-01-01 19:00:00  2022-01-01 23:00:00

输出:

   machine             ts_start               ts_end  isOverlap
0        1  2022-01-01 20:00:00  2022-01-01 21:00:00          1
1        1  2022-01-01 20:30:00  2022-01-01 21:30:00          1
2        1  2022-01-02 20:30:00  2022-01-02 20:35:00          0
3        2  2022-01-01 19:00:00  2022-01-01 23:00:00          0

假设仅按分钟检查重叠,您可以尝试:

#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)

#explode to get one row per minute
df_sample = df_sample.explode("times")

#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)

#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()

>>> output
   machine             ts_start               ts_end  isOverlap
0        1  2022-01-01 20:00:00  2022-01-01 21:00:00          1
1        1  2022-01-01 20:30:00  2022-01-01 21:30:00          1
2        1  2022-01-02 20:30:00  2022-01-02 20:35:00          0
3        2  2022-01-01 19:00:00  2022-01-01 23:00:00          0

piso确实可以用。它在大型数据集上会 运行 快速,并且不限于对时间采样率的假设。修改你的 piso 示例以将最后两行包装在一个函数中:

def make_overlaps(df):
    ii = pd.IntervalIndex.from_arrays(df["ts_start"], df["ts_end"])
    df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
    return df

然后在机器一栏分组df_sample,申请:

 df_sample.groupby("machine").apply(make_overlaps)

这会给你:

   machine            ts_start              ts_end  isOverlap
0        1 2022-01-01 20:00:00 2022-01-01 21:00:00          1
1        1 2022-01-01 20:30:00 2022-01-01 21:30:00          1
2        1 2022-01-02 20:30:00 2022-01-02 20:35:00          0
3        2 2022-01-01 19:00:00 2022-01-01 23:00:00          0