加入多个条件时的奇怪行为
Odd behavior when joining with multiple conditions
在 关于 data.table
包的滚动连接中,我 运行 在使用多个条件时遇到了一些奇怪的行为。
考虑以下数据集:
dt <- data.table(t_id = c(1,4,2,3,5), place = c("a","a","d","a","d"), num = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("place"))
dt_lu <- data.table(f_id = c(rep(1,4),rep(2,3)), place = c("a","b","c","d","a","d","a"), num = c(6,7,8,9,6,7,8), key=c("place"))
当我想加入 dt
和 dt_lu
时,只有那些 dt_lu
具有相同 place
并且 dt_lu$num
高于dt$num
如下:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num],
fid = f_id),
by = .EACHI]
我得到了想要的结果:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 8 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 8 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 8 2
10: d 2 6.2 9 1
11: d 2 6.2 7 2
12: d 5 6.2 9 1
13: d 5 6.2 7 2
当我想添加一个附加条件时,我可以通过如下链接附加条件来轻松获得所需的结果:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num],
fid = f_id),
by = .EACHI][fnum - tnum < 2]
这给了我:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 4 5.1 6 1
4: a 4 5.1 6 2
5: a 3 5.1 6 1
6: a 3 5.1 6 2
7: d 2 6.2 7 2
8: d 5 6.2 7 2
然而,当我添加额外条件(即:差异必须小于 2
)时,如下所示:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id),
by = .EACHI]
我没有得到预期的结果:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 6 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 6 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 6 2
10: d 2 6.2 7 1
11: d 2 6.2 7 2
12: d 5 6.2 7 1
13: d 5 6.2 7 2
此外,我收到以下警告消息:
Warning message: In [.data.table
(dt_lu, dt, list(tid = i.t_id, tnum
= i.num, fnum = num[i.num < : Column 3 of result for group 1 is length 2 but the longest column in this result is 3. Recycled leaving
remainder of 1 items. This warning is once only for the first group
with this issue.
预期结果为:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
11: d 2 6.2 7 2
13: d 5 6.2 7 2
我特意保留了第一个示例中的行号,以显示最终结果中必须保留哪些行(这与工作解决方案相同)。
如 所示,应该可以在连接操作中使用多个条件。
我尝试了以下替代方法,但它们都不起作用:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[(i.num < num) & (num - i.num < 2)],
fid = f_id),
by = .EACHI]
dt_lu[dt, {
val = num[(i.num < num) & (num - i.num < 2)];
list(tid = i.t_id,
tnum = i.num,
fnum = val,
fid = f_id)},
by = .EACHI]
有人能解释一下为什么在连接操作中有多个条件我没有得到想要的结果吗?
警告消息泄露了问题。此外,使用 print()
在这里很有帮助。
dt_lu[dt, print(i.num < num & num - i.num < 2), by=.EACHI]
# [1] TRUE TRUE FALSE
# [1] TRUE TRUE FALSE
# [1] TRUE TRUE FALSE
# [1] FALSE TRUE
# [1] FALSE TRUE
# Empty data.table (0 rows) of 3 cols: place,place,num
考虑条件计算结果为 TRUE, TRUE, FALSE
的第一种情况。该组有 3 个观察值。您的 j-expression
包含:
.(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id)
i.t_id
和 i.num
的长度为 1(因为它们来自 dt
)。但是 num[..condn..]
将 return length = 2,而 f_id
将 return length = 3。length=1 和 length=2 项目将被回收到长度longest item/vector = 3。这会导致不正确的结果。由于 3 不能被 2 完全整除,因此 return 是警告。
您打算做的是:
.(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id[i.num < num & num - i.num < 2])
或等同于:
{
idx = i.num < num & num - i.num < 2
.(tid = i.t_id, tnum = i.num, fnum = num[idx], fid = f_id[idx])
}
放在一起:
dt_lu[dt,
{
idx = i.num < num & num - i.num < 2
.(tid = i.t_id, tnum = i.num, fnum = num[idx], fid = f_id[idx])
},
by = .EACHI]
# place tid tnum fnum fid
# 1: a 1 5.1 6 1
# 2: a 1 5.1 6 2
# 3: a 4 5.1 6 1
# 4: a 4 5.1 6 2
# 5: a 3 5.1 6 1
# 6: a 3 5.1 6 2
# 7: d 2 6.2 7 2
# 8: d 5 6.2 7 2
在 data.table
包的滚动连接中,我 运行 在使用多个条件时遇到了一些奇怪的行为。
考虑以下数据集:
dt <- data.table(t_id = c(1,4,2,3,5), place = c("a","a","d","a","d"), num = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("place"))
dt_lu <- data.table(f_id = c(rep(1,4),rep(2,3)), place = c("a","b","c","d","a","d","a"), num = c(6,7,8,9,6,7,8), key=c("place"))
当我想加入 dt
和 dt_lu
时,只有那些 dt_lu
具有相同 place
并且 dt_lu$num
高于dt$num
如下:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num],
fid = f_id),
by = .EACHI]
我得到了想要的结果:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 8 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 8 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 8 2
10: d 2 6.2 9 1
11: d 2 6.2 7 2
12: d 5 6.2 9 1
13: d 5 6.2 7 2
当我想添加一个附加条件时,我可以通过如下链接附加条件来轻松获得所需的结果:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num],
fid = f_id),
by = .EACHI][fnum - tnum < 2]
这给了我:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 4 5.1 6 1
4: a 4 5.1 6 2
5: a 3 5.1 6 1
6: a 3 5.1 6 2
7: d 2 6.2 7 2
8: d 5 6.2 7 2
然而,当我添加额外条件(即:差异必须小于 2
)时,如下所示:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id),
by = .EACHI]
我没有得到预期的结果:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 6 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 6 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 6 2
10: d 2 6.2 7 1
11: d 2 6.2 7 2
12: d 5 6.2 7 1
13: d 5 6.2 7 2
此外,我收到以下警告消息:
Warning message: In
[.data.table
(dt_lu, dt, list(tid = i.t_id, tnum = i.num, fnum = num[i.num < : Column 3 of result for group 1 is length 2 but the longest column in this result is 3. Recycled leaving remainder of 1 items. This warning is once only for the first group with this issue.
预期结果为:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
11: d 2 6.2 7 2
13: d 5 6.2 7 2
我特意保留了第一个示例中的行号,以显示最终结果中必须保留哪些行(这与工作解决方案相同)。
如
我尝试了以下替代方法,但它们都不起作用:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[(i.num < num) & (num - i.num < 2)],
fid = f_id),
by = .EACHI]
dt_lu[dt, {
val = num[(i.num < num) & (num - i.num < 2)];
list(tid = i.t_id,
tnum = i.num,
fnum = val,
fid = f_id)},
by = .EACHI]
有人能解释一下为什么在连接操作中有多个条件我没有得到想要的结果吗?
警告消息泄露了问题。此外,使用 print()
在这里很有帮助。
dt_lu[dt, print(i.num < num & num - i.num < 2), by=.EACHI]
# [1] TRUE TRUE FALSE
# [1] TRUE TRUE FALSE
# [1] TRUE TRUE FALSE
# [1] FALSE TRUE
# [1] FALSE TRUE
# Empty data.table (0 rows) of 3 cols: place,place,num
考虑条件计算结果为 TRUE, TRUE, FALSE
的第一种情况。该组有 3 个观察值。您的 j-expression
包含:
.(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id)
i.t_id
和 i.num
的长度为 1(因为它们来自 dt
)。但是 num[..condn..]
将 return length = 2,而 f_id
将 return length = 3。length=1 和 length=2 项目将被回收到长度longest item/vector = 3。这会导致不正确的结果。由于 3 不能被 2 完全整除,因此 return 是警告。
您打算做的是:
.(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id[i.num < num & num - i.num < 2])
或等同于:
{
idx = i.num < num & num - i.num < 2
.(tid = i.t_id, tnum = i.num, fnum = num[idx], fid = f_id[idx])
}
放在一起:
dt_lu[dt,
{
idx = i.num < num & num - i.num < 2
.(tid = i.t_id, tnum = i.num, fnum = num[idx], fid = f_id[idx])
},
by = .EACHI]
# place tid tnum fnum fid
# 1: a 1 5.1 6 1
# 2: a 1 5.1 6 2
# 3: a 4 5.1 6 1
# 4: a 4 5.1 6 2
# 5: a 3 5.1 6 1
# 6: a 3 5.1 6 2
# 7: d 2 6.2 7 2
# 8: d 5 6.2 7 2