加入多个条件时的奇怪行为

Odd behavior when joining with multiple conditions

关于 data.table 包的滚动连接中,我 运行 在使用多个条件时遇到了一些奇怪的行为。

考虑以下数据集:

dt <- data.table(t_id = c(1,4,2,3,5), place = c("a","a","d","a","d"), num = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("place"))
dt_lu <- data.table(f_id = c(rep(1,4),rep(2,3)), place = c("a","b","c","d","a","d","a"), num = c(6,7,8,9,6,7,8), key=c("place"))

当我想加入 dtdt_lu 时,只有那些 dt_lu 具有相同 place 并且 dt_lu$num 高于dt$num如下:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num],
               fid = f_id),
      by = .EACHI]

我得到了想要的结果:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 3:     a   1  5.1    8   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 6:     a   4  5.1    8   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
 9:     a   3  5.1    8   2
10:     d   2  6.2    9   1
11:     d   2  6.2    7   2
12:     d   5  6.2    9   1
13:     d   5  6.2    7   2

当我想添加一个附加条件时,我可以通过如下链接附加条件来轻松获得所需的结果:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num],
               fid = f_id),
      by = .EACHI][fnum - tnum < 2]

这给了我:

   place tid tnum fnum fid
1:     a   1  5.1    6   1
2:     a   1  5.1    6   2
3:     a   4  5.1    6   1
4:     a   4  5.1    6   2
5:     a   3  5.1    6   1
6:     a   3  5.1    6   2
7:     d   2  6.2    7   2
8:     d   5  6.2    7   2

然而,当我添加额外条件(即:差异必须小于 2)时,如下所示:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num & num - i.num < 2],
               fid = f_id),
      by = .EACHI]

我没有得到预期的结果:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 3:     a   1  5.1    6   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 6:     a   4  5.1    6   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
 9:     a   3  5.1    6   2
10:     d   2  6.2    7   1
11:     d   2  6.2    7   2
12:     d   5  6.2    7   1
13:     d   5  6.2    7   2

此外,我收到以下警告消息:

Warning message: In [.data.table(dt_lu, dt, list(tid = i.t_id, tnum = i.num, fnum = num[i.num < : Column 3 of result for group 1 is length 2 but the longest column in this result is 3. Recycled leaving remainder of 1 items. This warning is once only for the first group with this issue.

预期结果为:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
11:     d   2  6.2    7   2
13:     d   5  6.2    7   2

我特意保留了第一个示例中的行号,以显示最终结果中必须保留哪些行(这与工作解决方案相同)。

所示,应该可以在连接操作中使用多个条件。

我尝试了以下替代方法,但它们都不起作用:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[(i.num < num) & (num - i.num < 2)],
               fid = f_id),
      by = .EACHI]

dt_lu[dt, {
  val = num[(i.num < num) & (num - i.num < 2)];
  list(tid = i.t_id,
       tnum = i.num,
       fnum = val,
       fid = f_id)},
  by = .EACHI]

有人能解释一下为什么在连接操作中有多个条件我没有得到想要的结果吗?

警告消息泄露了问题。此外,使用 print() 在这里很有帮助。

dt_lu[dt, print(i.num < num & num - i.num < 2), by=.EACHI]
# [1]  TRUE  TRUE FALSE
# [1]  TRUE  TRUE FALSE
# [1]  TRUE  TRUE FALSE
# [1] FALSE  TRUE
# [1] FALSE  TRUE
# Empty data.table (0 rows) of 3 cols: place,place,num

考虑条件计算结果为 TRUE, TRUE, FALSE 的第一种情况。该组有 3 个观察值。您的 j-expression 包含:

.(tid = i.t_id,
  tnum = i.num,
  fnum = num[i.num < num & num - i.num < 2],
  fid = f_id)

i.t_idi.num 的长度为 1(因为它们来自 dt)。但是 num[..condn..] 将 return length = 2,而 f_id 将 return length = 3。length=1 和 length=2 项目将被回收到长度longest item/vector = 3。这会导致不正确的结果。由于 3 不能被 2 完全整除,因此 return 是警告。

您打算做的是:

.(tid = i.t_id,
  tnum = i.num,
  fnum = num[i.num < num & num - i.num < 2],
  fid = f_id[i.num < num & num - i.num < 2])

或等同于:

{  
  idx = i.num < num & num - i.num < 2
  .(tid  = i.t_id, tnum = i.num, fnum = num[idx], fid  = f_id[idx])
}

放在一起:

dt_lu[dt, 
       {
         idx = i.num < num & num - i.num < 2
        .(tid  = i.t_id, tnum = i.num, fnum = num[idx], fid  = f_id[idx])
       }, 
by = .EACHI]
#    place tid tnum fnum fid
# 1:     a   1  5.1    6   1
# 2:     a   1  5.1    6   2
# 3:     a   4  5.1    6   1
# 4:     a   4  5.1    6   2
# 5:     a   3  5.1    6   1
# 6:     a   3  5.1    6   2
# 7:     d   2  6.2    7   2
# 8:     d   5  6.2    7   2