"object not found" 同时使用 .SD 和 .by 中的表达式时

"object not found" when using .SD and expression in .by together

这个问题让我抓狂:

首先假设我们有这个示例数据集:

set.seed(42)
dt <- data.table(mydate = seq(as.Date("2009-01-01"), as.Date("2012-01-01"), by = "day"),
                 id = sample(1:5, 1096, replace = T),
                 id.sub = sample(letters[1:3], 1096, replace = T),
                 val = rnorm(1096))

看起来像这样:

           mydate id id.sub        val
   1: 2009-01-01  4      c -0.2712793
   2: 2009-01-02  5      b  1.8967819
   3: 2009-01-03  3      b  1.0168226
   4: 2009-01-04  5      a  0.8324829
   5: 2009-01-05  1      a -1.8251198
  ---                                
1092: 2011-12-28  4      c -1.2794301
1093: 2011-12-29  2      a  0.1221805
1094: 2011-12-30  2      c -1.2370464
1095: 2011-12-31  3      c  2.2440864
1096: 2012-01-01  2      a  1.1407802

现在我想计算日期的最大值和如果 id.sub 等于 "b",每个 id 和每个星期的 val 的平均值。这是我走了多远:

dt[,
   .(max.date = max(mydate),
     mean.val = mean(.SD[id.sub == "b", val])),
   by = list(id, wk = format(mydate, "%Y-%V"))]

然而,下面的错误让我一直用头撞墙:

Error in `[.data.table`(dt, , .(max.date = max(mydate, na.rm = T), mean = sum(.SD[id.sub ==  : 

  object 'mydate' not found

如果我删除 "mean.val" 行或 "max.date" 行,代码可以工作,但是当它们放在一起时它就不能 运行 正确。我不知道怎么会出错,谁能帮帮我?非常感谢。

我的data.table版本是v1.9.5

我想你在找 mean.val = mean(val[id.sub == "b"])。这是编写子集的更标准方法。注意 .()list() 的别名,也可以用在 by.

dt[, .(
    max.date = max(mydate),
    mean.val = mean(val[id.sub == "b"])),
    by = .(id, wk = format(mydate, "%Y-%V"))
]
#      id      wk   max.date   mean.val
#   1:  5 2009-01 2009-01-04  1.9335678
#   2:  2 2009-01 2009-01-03        NaN
#   3:  4 2009-02 2009-01-10  0.1603871
#   4:  3 2009-02 2009-01-11        NaN
#   5:  1 2009-02 2009-01-08        NaN
#  ---                                 
# 619:  3 2011-51 2011-12-24        NaN
# 620:  1 2011-52 2011-12-28        NaN
# 621:  4 2011-52 2011-12-29 -0.8534370
# 622:  2 2011-52 2011-12-31 -1.2628962
# 623:  3 2012-52 2012-01-01 -1.7779465

如果我们在分组后查看所有列,您就会明白为什么您的尝试不起作用。

names(dt[, .SD, by = .(id, wk = format(mydate, "%Y-%V"))])
# [1] "id"     "wk"     "id.sub" "val"   

如我们所见,mydate 已不存在。我要到此为止,因为我不确定我是否可以提供有关原因的技术解释。正如akrun所说,这是因为它已经被修改了。