R - 使用时间来限制组的大小

Question

我有一个按时间索引的非常大的数据集。我想按时间将消息分组在一起，第一条消息（在时间 T）开始一个组，该组一直持续到时间 T+X，此时开始一个新组。数据集在观察值之间可以有很大的差距 (>X)

最大组大小（X，上面）为 2 个时间刻度的示例。 "group" 列是所需的输出：

> example=data.table(time=c(1,2,3,4,8,13,14,17), 
group=c(1,1,2,2,3,4,4,5))
> example
   time group
1:    1     1
2:    2     1
3:    3     2
4:    4     2
5:    8     3
6:   13     4
7:   14     4
8:   17     5

X=7时的另一个例子

> example2=data.table(time=c(43,44,75,76,77,80,81,82,83,84), group=c(1,1,2,2,2,2,2,3,3,3))
> example2
    time group
 1:   43     1
 2:   44     1
 3:   75     2
 4:   76     2
 5:   77     2
 6:   80     2
 7:   81     2
 8:   82     3
 9:   83     3
10:   84     3

我考虑过这样做的一种方法是计算时间之间的差异并使用 cumsum 函数，该函数在达到阈值（在本例中为 2）后重置为零，但我一直无法弄清楚如何实现重置。我担心我这里唯一的解决方案最终将是迭代的（结果，对于 R 本身来说太慢了）。

---编辑一些我更具体地尝试过的例子首先计算时间deltas

的cumsum

> example[,cumulativeTime:=cumsum(c(0,diff(time)))]
> example
   time group timeDiff cumulativeTime
1:    1     1       NA              0
2:    2     1        1              1
3:    3     2        1              2
4:    4     2        1              3
5:    8     3        4              7
6:   13     4        5             12
7:   14     4        1             13
8:   17     5        3             16

然后考虑通过最大时间滴答数对累积时间取模，并认为当后续模之间的增量 < 0 时，这将表示一个新组，但是正如您所看到的那样中断当数据中存在任何有意义的差距时下降。

> example[,cumTimeMod := cumulativeTime %% 2]
> example
   time group timeDiff cumulativeTime cumTimeMod
1:    1     1       NA              0           0
2:    2     1        1              1           1
3:    3     2        1              2           0
4:    4     2        1              3           1
5:    8     3        4              7           1
6:   13     4        5             12           0
7:   14     4        1             13           1
8:   17     5        3             16           0

此外，还尝试了整数除法而不是模数，但也失败了。不同的例子，其中 X=7（下面也包含了 mt1022 的建议）：

    time timeDiff cumulativeTime intDivOfCsumByX desiredGroup g1 g2 g
 1:   43        0              0               0            1  0  1 1
 2:   44        1              1               0            1  0  1 1
 3:   75       31             32               4            2 30  1 2
 4:   76        1             33               4            2 30  1 2
 5:   77        1             34               4            2 30  1 2
 6:   80        3             37               5            2 32  1 3
 7:   81        1             38               5            2 32  1 3
 8:   82        1             39               5            3 32  1 3
 9:   83        1             40               5            3 32  1 3
10:   84        1             41               5            3 32  1 3

Answer 1

我最终采用了 Rcpp 方法来克服 R+迭代算法的缓慢

cpp.cumsumgrp = cppFunction('
NumericVector cumsumgrp(NumericVector x, int resetMax) {
int n = x.size();
NumericVector tmp(n);
NumericVector res(n);
tmp[0]=0;
long groupCount=0;
for(int i=1;i<n;i++){
    long csum=tmp[i-1]+x[i];
    if(csum > resetMax) {
      groupCount++;
    }
    tmp[i] = csum > resetMax ? 0 : csum;
    res[i] = groupCount;

}
return(res);
}');

用法（x-1占inclusive/exclusive最大cumsum。我也不关心组ID的实际数值，只关心同一组中的所有消息具有相同的ID）：

> x=7
> example2[,assignedGroup:=cumsumgrp(c(0,diff(time)), x-1)]
> example2
    time desiredGroup assignedGroup
 1:   43            1             0
 2:   44            1             0
 3:   75            2             1
 4:   76            2             1
 5:   77            2             1
 6:   80            2             1
 7:   81            2             1
 8:   82            3             2
 9:   83            3             2
10:   84            3             2

R - 使用时间来限制组的大小

R - Use time to restrict size of group

r

rcpp

cumsum

dplyr

data.table