将一系列区间与其自身进行比较

Compare a series of intervals with itself

对于一系列 Interval

s = pd.Series([
    pd.Interval(left=pd.Timestamp('2020-01-01'), right=pd.Timestamp('2020-01-05'), closed='both'), 
    pd.Interval(left=pd.Timestamp('2020-01-01'), right=pd.Timestamp('2020-01-02'), closed='both'), 
    pd.Interval(left=pd.Timestamp('2020-01-04'), right=pd.Timestamp('2020-01-05'), closed='both'), 
])

我想检查每个间隔对 - 如 外积 - 是否重叠。为此 Interval 提供了方法 overlaps().

结果应该是一个 l x l matrix/data 帧,用于长度为 l 的系列,包含该对是否重叠。例如:

+--------------------------+--------------------------+--------------------------+--------------------------+
|                          | [2020-01-01, 2020-01-05] | [2020-01-01, 2020-01-02] | [2020-01-04, 2020-01-05] |
+--------------------------+--------------------------+--------------------------+--------------------------+
| [2020-01-01, 2020-01-05] | True                     | True                     | True                     |
+--------------------------+--------------------------+--------------------------+--------------------------+
| [2020-01-01, 2020-01-02] | True                     | True                     | False                    |
+--------------------------+--------------------------+--------------------------+--------------------------+
| [2020-01-04, 2020-01-05] | True                     | False                    | False                    |
+--------------------------+--------------------------+--------------------------+--------------------------+

因为这个系列相当大,我正在寻找一种比 itertuples() 性能更好、效率更高的方法。

你可以使用 pd.IntervalIndex, to be able to get right and left bounds easily and use numpy ufunc.outer with greater_equal and less_equal.

import numpy as np

#work with IntervalIndex
idx = pd.IntervalIndex(s)
#get right and left bounds
right = idx.right
left = idx.left

#create the boolean of True and False
arr = np.greater_equal.outer(right, left) & np.less_equal.outer(left, right)

#create the dataframe if needed
print (pd.DataFrame(arr, index=s.values, columns=s.values))
                          [2020-01-01, 2020-01-05]  [2020-01-01, 2020-01-02]  \
[2020-01-01, 2020-01-05]                      True                      True   
[2020-01-01, 2020-01-02]                      True                      True   
[2020-01-04, 2020-01-05]                      True                     False   

                          [2020-01-04, 2020-01-05]  
[2020-01-01, 2020-01-05]                      True  
[2020-01-01, 2020-01-02]                     False  
[2020-01-04, 2020-01-05]                      True  

看来您也可以在 IntervalIndex 上使用 overlaps 并执行如下操作:

np.stack([idx.overlaps(interval) for interval in idx])
#or for dataframe
pd.DataFrame([idx.overlaps(interval) for interval in idx], 
             index=s.values, columns=s.values)