如何在时间索引的 DataFrame 中找到时间重叠

How to find temporal overlap in a time-indexed DataFrame

我有一个 DataFrame,其中索引是 date_time 并且列中的数据随时间交错。也许最好的解释是展示这个 DF:

>>> c
                     A           B          C          D
2015-01-01  0.09607408         NaN        NaN        NaN
2015-01-02         NaN  0.03582221        NaN        NaN
2015-01-03   0.2750026         NaN        NaN        NaN
2015-01-04         NaN    0.892619        NaN        NaN
2015-01-05   0.8574456         NaN        NaN        NaN
2015-01-06         NaN  0.08720886        NaN        NaN
2015-01-07   0.7091732         NaN        NaN        NaN
2015-01-08         NaN  0.09354087        NaN        NaN
2015-01-09     0.60924         NaN        NaN        NaN
2015-01-10         NaN   0.1966458        NaN        NaN
2015-01-11         NaN         NaN  0.5135616        NaN
2015-01-12         NaN         NaN        NaN  0.3015004
2015-01-13         NaN         NaN  0.5717249        NaN
2015-01-14         NaN         NaN        NaN  0.5416951
2015-01-15         NaN         NaN  0.1031428        NaN
2015-01-16         NaN         NaN        NaN  0.2944353
2015-01-17         NaN         NaN   0.642031        NaN
2015-01-18         NaN         NaN        NaN  0.2546383
2015-01-19         NaN         NaN  0.6536632        NaN
2015-01-20         NaN         NaN        NaN  0.9877289
2015-01-21         NaN         NaN        NaN        NaN

现在,由于 A 列和 B 列在一段时间内交错排列且有大量重叠,出于分析目的,我会将它们视为可比较的。

同样,C 和 D 数据都出现在彼此大量重叠的时间段内,但与 A/B 时间段重叠为零。

我正在尝试想出一种巧妙的方法来将 A/B 和 C/D 识别为一对。我可以设想用 c.A.first_valid_index() 等来做……如果我那样做,这一切都是非常代数的。我想知道是否有一种巧妙的方法可以使用时间序列工具中的一些内置 "overlap" 函数来做到这一点。我找不到任何这样的东西 - 希望它存在。 TIA

制作上面人为示例 DF 的代码是:

t = pd.date_range('20150101',periods=21)
ti = t.to_datetime()

c = pd.DataFrame(index = ti, columns=['A','B','C','D'])

c.A[0:10:2] = np.random.rand(5)
c.B[1:11:2] = np.random.rand(5)
c.C[10:20:2] = np.random.rand(5)
c.D[11:21:2] = np.random.rand(5)

这是要做的事情。

将两列组合传递给 overlap 函数。这是做什么的

def overlap(cols):
    v = c[cols[0]].fillna(c[cols[1]]).notnull()
    days = (v[v].index.max() - v[v].index.min()).days + 1
    length = len(v[v])
    return 'Overlap' if length == days else 'No'

它将使用 cols[1]c[cols[0]].fillna(c[cols[1]]) 填充 cols[0] 中的 NaN 值,然后仅使用 notnul()[=23= 提取非空值]

之后,找到 maxmin 日期以获得日期范围,即 days。并且,然后找出重叠系列的长度是否与 days

匹配

现在,使用 overlap(cols)

迭代列组合
In [14]: for cols in list(combinations(c.columns, 2)):
   ....:     print cols, overlap(cols)                
   ....:                                              
('A', 'B') Overlap                                    
('A', 'C') No                                         
('A', 'D') No                                         
('B', 'C') No                                         
('B', 'D') No
('C', 'D') Overlap 

不确定使用 .first_valid_index() 有什么问题 - 对我来说看起来很漂亮:

periods = pd.Series([pd.date_range(c[col].first_valid_index(), 
                                   c[col].last_valid_index(), freq='D') 
                     for col in c.columns.tolist()], index=c.columns)
overlaps = periods.apply(lambda x: periods.apply(lambda y: x.isin(y).any()))
print overlaps

给出一个易于使用的重叠矩阵:

       A      B      C      D
A   True   True  False  False
B   True   True  False  False
C  False  False   True   True
D  False  False   True   True

检查重叠很简单的地方:

print overlaps.loc['A','B']
# True
print overlaps.loc['A','C']
# False

或转换为系列:

overlaps = overlaps.stack()
print overlaps

A  A     True
   B     True
   C    False
   D    False
B  A     True
   B     True
   C    False
   D    False
C  A    False
   B    False
   C     True
   D     True
D  A    False
   B    False
   C     True
   D     True
dtype: bool

无需 .loc 即可访问它:

print overlaps['A','B']
# True
print overlaps['A','C']
# False