在大型数据表上执行时,如何防止 {data.table}foverlaps 将 NA 送入其 any(...) 调用?

How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?

首先类似的问题:

Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop

故事

我正在计算荧光排放(每 1 分钟测量一次)与给定事件重叠的次数。当发射时间在事件时间之前 10 分钟或之后 30 分钟时,发射被认为与给定事件重叠。我们总共考虑三个事件:AC、CO 和 MT。

数据

编辑 1:

下面是两个允许执行以下代码的示例数据集。 对于这些集合,代码运行得很好。一旦我有产生错误的数据,我将进行第二次编辑。 请注意,下面示例数据集中的 event.GN 是 data.table 而不是列表

emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))

编辑 2: 我创建了一个包含生成错误的数据 event.GN 的 csv 文件。该文件有 26383 行的一个变量数据,但只需要大约 14000 行就可以生成错误。

编辑 3: 直到 dat "2017-03-26 00:25:20" 函数工作正常。在添加带有 dat“2017-03-26 01:33:46”的下一条记录后,立即发生错误。我注意到在这些点之间有超过 60 分钟的时间。这意味着在这两个事件时间之间,一个或多个排放记录不会有相应的事件。这反过来会生成 NA,而 NA 会以某种方式陷入 foverlaps 函数的 any() 调用中。我的方向对吗?

荧光排放存储在一个名为 emissions.GN 的大型数据表(约 100 万行)中。请注意,只有 date.time (POSIXct) 变量与我的问题相关。

emissions.GN的例子:

         date.time     fluor hall                  period        dt
 1: 2016-01-01 00:17:04 0.3044254   GN [2016-01-01,2016-02-21] -16.07373
 2: 2016-01-01 00:17:04 0.4368381   GN [2016-01-01,2016-02-21] -16.07373
 3: 2016-01-01 00:18:04 0.5655382   GN [2016-01-01,2016-02-21] -16.07395
 4: 2016-01-01 00:19:04 0.6542259   GN [2016-01-01,2016-02-21] -16.07417
 5: 2016-01-01 00:21:04 0.6579384   GN [2016-01-01,2016-02-21] -16.07462

三个事件的数据存储在名为 events.GN 的列表中包含的三个较小的数据表(约 2 万条记录)中。请注意,只有 dat (POSIXct) 变量与我的问题相关。

AC 事件示例(CO 和 MT 类似):

events.GN[["AC"]]
              dat hall numevt                                              txtevt
1: 2016-01-01 00:04:54   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
2: 2016-01-01 00:09:21   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
3: 2016-01-01 00:38:53   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
4: 2016-01-01 02:30:33   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
5: 2016-01-01 02:34:11   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)

函数

我编写了一个函数,它在给定的(大)x 数据表和给定的(小)y 数据表上应用重叠。函数 return 是一个包含两列的数据表。第一列 yid 包含与事件至少重叠一次的 emissions.GN 个观察值的索引。第二列 N 包含重叠计数(即该特定索引发生重叠的次数)。结果中省略了零重叠的排放指数。

# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.  
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}

函数成功执行事件AC和CO

当调用事件 AC 和 CO 时,该函数给出了如上所述的预期结果:

find_index_and_count(emissions.GN,events.GN[["AC"]])
   yid N
 1:       1 1
 2:       2 1
 3:       3 1
 4:       4 1
 5:       5 2
---          
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
 1:       3 1
 2:       4 1
 3:       5 1
 4:       6 1
 5:       7 1
---          

函数 return在 MT 事件上调用时出错

以下函数调用导致以下错误:

find_index_and_count(emissions.GN,events.GN[["MT"]])

Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed

5.foverlaps(event, hall, nomatch = NA, which = TRUE)

4.eval(lhs, parent, parent)

3.eval(lhs, parent, parent)

2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit

1.find_index_and_count(emissions.GN, events.GN[["MT"]])

到目前为止我尝试了什么

首先,在上面链接的类似问题中,有人指出了以下想法:

This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50

因此,当发现 x 和 y 之间没有重叠时,我将对 foverlaps 的调用修改为 return 0 而不是 NA,如下所示:

foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit

这没有改变任何东西(该功能适用​​于 AC 和 CO 但不适用于 MT)。

其次,我绝对确保 none 我的数据表包含 NA。

更多信息

I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.

只是解决这个 objective(因为我不太了解 foverlaps。)...

event.GN[, n := 
  emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up), 
    .N
  , by=.EACHI]$N
]

                       dat  n
    1: 2016-01-01 00:00:00 31
    2: 2016-01-01 00:15:00 41
    3: 2016-01-01 00:30:00 41
    4: 2016-01-01 00:45:00 41
    5: 2016-01-01 01:00:00 41
   ---                       
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41

check/verify 其中一项...

> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
                   dat  n
1: 2016-01-02 00:30:00 41
> 
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
              date.time
 1: 2016-01-02 00:20:00
 2: 2016-01-02 00:21:00
 3: 2016-01-02 00:22:00
 4: 2016-01-02 00:23:00
 5: 2016-01-02 00:24:00
 6: 2016-01-02 00:25:00
 7: 2016-01-02 00:26:00
 8: 2016-01-02 00:27:00
 9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
              date.time