在大型数据表上执行时,如何防止 {data.table}foverlaps 将 NA 送入其 any(...) 调用?
How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?
首先类似的问题:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
故事
我正在计算荧光排放(每 1 分钟测量一次)与给定事件重叠的次数。当发射时间在事件时间之前 10 分钟或之后 30 分钟时,发射被认为与给定事件重叠。我们总共考虑三个事件:AC、CO 和 MT。
数据
编辑 1:
下面是两个允许执行以下代码的示例数据集。
对于这些集合,代码运行得很好。一旦我有产生错误的数据,我将进行第二次编辑。 请注意,下面示例数据集中的 event.GN 是 data.table 而不是列表
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
编辑 2:
我创建了一个包含生成错误的数据 event.GN 的 csv 文件。该文件有 26383 行的一个变量数据,但只需要大约 14000 行就可以生成错误。
编辑 3:
直到 dat "2017-03-26 00:25:20" 函数工作正常。在添加带有 dat“2017-03-26 01:33:46”的下一条记录后,立即发生错误。我注意到在这些点之间有超过 60 分钟的时间。这意味着在这两个事件时间之间,一个或多个排放记录不会有相应的事件。这反过来会生成 NA,而 NA 会以某种方式陷入 foverlaps 函数的 any() 调用中。我的方向对吗?
荧光排放存储在一个名为 emissions.GN 的大型数据表(约 100 万行)中。请注意,只有 date.time (POSIXct) 变量与我的问题相关。
emissions.GN的例子:
date.time fluor hall period dt
1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373
2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373
3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395
4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417
5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
三个事件的数据存储在名为 events.GN 的列表中包含的三个较小的数据表(约 2 万条记录)中。请注意,只有 dat (POSIXct) 变量与我的问题相关。
AC 事件示例(CO 和 MT 类似):
events.GN[["AC"]]
dat hall numevt txtevt
1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
函数
我编写了一个函数,它在给定的(大)x 数据表和给定的(小)y 数据表上应用重叠。函数 return 是一个包含两列的数据表。第一列 yid 包含与事件至少重叠一次的 emissions.GN 个观察值的索引。第二列 N 包含重叠计数(即该特定索引发生重叠的次数)。结果中省略了零重叠的排放指数。
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
函数成功执行事件AC和CO
当调用事件 AC 和 CO 时,该函数给出了如上所述的预期结果:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
1: 3 1
2: 4 1
3: 5 1
4: 6 1
5: 7 1
---
函数 return在 MT 事件上调用时出错
以下函数调用导致以下错误:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
- 只要 x (emissions.FN) 中的记录与 y (events.FN[["AC"]]等)。
- 我不明白为什么函数在事件 MT 上失败,而它对 AC 和 CO 工作得很好。除了值和记录数略有不同外,数据完全相同。
到目前为止我尝试了什么
首先,在上面链接的类似问题中,有人指出了以下想法:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
因此,当发现 x 和 y 之间没有重叠时,我将对 foverlaps 的调用修改为 return 0 而不是 NA,如下所示:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
这没有改变任何东西(该功能适用于 AC 和 CO 但不适用于 MT)。
其次,我绝对确保 none 我的数据表包含 NA。
更多信息
- 如果需要,我可以提供生成 emissions.FN 数据和所有 events.FN 数据的 SQL 代码。请注意,因为所有 events.FN 日期都具有相同的来源,所以事件 AC、CO 和 MT 的数据之间应该没有差异(除了值)。
- 如有其他需要,请随时询问!
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
只是解决这个 objective(因为我不太了解 foverlaps
。)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
check/verify 其中一项...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time
首先类似的问题:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
故事
我正在计算荧光排放(每 1 分钟测量一次)与给定事件重叠的次数。当发射时间在事件时间之前 10 分钟或之后 30 分钟时,发射被认为与给定事件重叠。我们总共考虑三个事件:AC、CO 和 MT。
数据
编辑 1:
下面是两个允许执行以下代码的示例数据集。 对于这些集合,代码运行得很好。一旦我有产生错误的数据,我将进行第二次编辑。 请注意,下面示例数据集中的 event.GN 是 data.table 而不是列表
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
编辑 2: 我创建了一个包含生成错误的数据 event.GN 的 csv 文件。该文件有 26383 行的一个变量数据,但只需要大约 14000 行就可以生成错误。
编辑 3: 直到 dat "2017-03-26 00:25:20" 函数工作正常。在添加带有 dat“2017-03-26 01:33:46”的下一条记录后,立即发生错误。我注意到在这些点之间有超过 60 分钟的时间。这意味着在这两个事件时间之间,一个或多个排放记录不会有相应的事件。这反过来会生成 NA,而 NA 会以某种方式陷入 foverlaps 函数的 any() 调用中。我的方向对吗?
荧光排放存储在一个名为 emissions.GN 的大型数据表(约 100 万行)中。请注意,只有 date.time (POSIXct) 变量与我的问题相关。
emissions.GN的例子:
date.time fluor hall period dt 1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373 2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373 3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395 4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417 5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
三个事件的数据存储在名为 events.GN 的列表中包含的三个较小的数据表(约 2 万条记录)中。请注意,只有 dat (POSIXct) 变量与我的问题相关。
AC 事件示例(CO 和 MT 类似):
events.GN[["AC"]]
dat hall numevt txtevt 1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I) 2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I) 3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I) 4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I) 5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
函数
我编写了一个函数,它在给定的(大)x 数据表和给定的(小)y 数据表上应用重叠。函数 return 是一个包含两列的数据表。第一列 yid 包含与事件至少重叠一次的 emissions.GN 个观察值的索引。第二列 N 包含重叠计数(即该特定索引发生重叠的次数)。结果中省略了零重叠的排放指数。
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
函数成功执行事件AC和CO
当调用事件 AC 和 CO 时,该函数给出了如上所述的预期结果:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N 1: 1 1 2: 2 1 3: 3 1 4: 4 1 5: 5 2 ---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N 1: 3 1 2: 4 1 3: 5 1 4: 6 1 5: 7 1 ---
函数 return在 MT 事件上调用时出错
以下函数调用导致以下错误:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
- 只要 x (emissions.FN) 中的记录与 y (events.FN[["AC"]]等)。
- 我不明白为什么函数在事件 MT 上失败,而它对 AC 和 CO 工作得很好。除了值和记录数略有不同外,数据完全相同。
到目前为止我尝试了什么
首先,在上面链接的类似问题中,有人指出了以下想法:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
因此,当发现 x 和 y 之间没有重叠时,我将对 foverlaps 的调用修改为 return 0 而不是 NA,如下所示:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
这没有改变任何东西(该功能适用于 AC 和 CO 但不适用于 MT)。
其次,我绝对确保 none 我的数据表包含 NA。
更多信息
- 如果需要,我可以提供生成 emissions.FN 数据和所有 events.FN 数据的 SQL 代码。请注意,因为所有 events.FN 日期都具有相同的来源,所以事件 AC、CO 和 MT 的数据之间应该没有差异(除了值)。
- 如有其他需要,请随时询问!
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
只是解决这个 objective(因为我不太了解 foverlaps
。)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
check/verify 其中一项...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time