"object not found" 同时使用 .SD 和 .by 中的表达式时
"object not found" when using .SD and expression in .by together
这个问题让我抓狂:
首先假设我们有这个示例数据集:
set.seed(42)
dt <- data.table(mydate = seq(as.Date("2009-01-01"), as.Date("2012-01-01"), by = "day"),
id = sample(1:5, 1096, replace = T),
id.sub = sample(letters[1:3], 1096, replace = T),
val = rnorm(1096))
看起来像这样:
mydate id id.sub val
1: 2009-01-01 4 c -0.2712793
2: 2009-01-02 5 b 1.8967819
3: 2009-01-03 3 b 1.0168226
4: 2009-01-04 5 a 0.8324829
5: 2009-01-05 1 a -1.8251198
---
1092: 2011-12-28 4 c -1.2794301
1093: 2011-12-29 2 a 0.1221805
1094: 2011-12-30 2 c -1.2370464
1095: 2011-12-31 3 c 2.2440864
1096: 2012-01-01 2 a 1.1407802
现在我想计算日期的最大值和如果 id.sub 等于 "b",每个 id 和每个星期的 val 的平均值。这是我走了多远:
dt[,
.(max.date = max(mydate),
mean.val = mean(.SD[id.sub == "b", val])),
by = list(id, wk = format(mydate, "%Y-%V"))]
然而,下面的错误让我一直用头撞墙:
Error in `[.data.table`(dt, , .(max.date = max(mydate, na.rm = T), mean = sum(.SD[id.sub == :
object 'mydate' not found
如果我删除 "mean.val" 行或 "max.date" 行,代码可以工作,但是当它们放在一起时它就不能 运行 正确。我不知道怎么会出错,谁能帮帮我?非常感谢。
我的data.table版本是v1.9.5
我想你在找 mean.val = mean(val[id.sub == "b"])
。这是编写子集的更标准方法。注意 .()
是 list()
的别名,也可以用在 by
.
中
dt[, .(
max.date = max(mydate),
mean.val = mean(val[id.sub == "b"])),
by = .(id, wk = format(mydate, "%Y-%V"))
]
# id wk max.date mean.val
# 1: 5 2009-01 2009-01-04 1.9335678
# 2: 2 2009-01 2009-01-03 NaN
# 3: 4 2009-02 2009-01-10 0.1603871
# 4: 3 2009-02 2009-01-11 NaN
# 5: 1 2009-02 2009-01-08 NaN
# ---
# 619: 3 2011-51 2011-12-24 NaN
# 620: 1 2011-52 2011-12-28 NaN
# 621: 4 2011-52 2011-12-29 -0.8534370
# 622: 2 2011-52 2011-12-31 -1.2628962
# 623: 3 2012-52 2012-01-01 -1.7779465
如果我们在分组后查看所有列,您就会明白为什么您的尝试不起作用。
names(dt[, .SD, by = .(id, wk = format(mydate, "%Y-%V"))])
# [1] "id" "wk" "id.sub" "val"
如我们所见,mydate
已不存在。我要到此为止,因为我不确定我是否可以提供有关原因的技术解释。正如akrun所说,这是因为它已经被修改了。
这个问题让我抓狂:
首先假设我们有这个示例数据集:
set.seed(42)
dt <- data.table(mydate = seq(as.Date("2009-01-01"), as.Date("2012-01-01"), by = "day"),
id = sample(1:5, 1096, replace = T),
id.sub = sample(letters[1:3], 1096, replace = T),
val = rnorm(1096))
看起来像这样:
mydate id id.sub val
1: 2009-01-01 4 c -0.2712793
2: 2009-01-02 5 b 1.8967819
3: 2009-01-03 3 b 1.0168226
4: 2009-01-04 5 a 0.8324829
5: 2009-01-05 1 a -1.8251198
---
1092: 2011-12-28 4 c -1.2794301
1093: 2011-12-29 2 a 0.1221805
1094: 2011-12-30 2 c -1.2370464
1095: 2011-12-31 3 c 2.2440864
1096: 2012-01-01 2 a 1.1407802
现在我想计算日期的最大值和如果 id.sub 等于 "b",每个 id 和每个星期的 val 的平均值。这是我走了多远:
dt[,
.(max.date = max(mydate),
mean.val = mean(.SD[id.sub == "b", val])),
by = list(id, wk = format(mydate, "%Y-%V"))]
然而,下面的错误让我一直用头撞墙:
Error in `[.data.table`(dt, , .(max.date = max(mydate, na.rm = T), mean = sum(.SD[id.sub == :
object 'mydate' not found
如果我删除 "mean.val" 行或 "max.date" 行,代码可以工作,但是当它们放在一起时它就不能 运行 正确。我不知道怎么会出错,谁能帮帮我?非常感谢。
我的data.table版本是v1.9.5
我想你在找 mean.val = mean(val[id.sub == "b"])
。这是编写子集的更标准方法。注意 .()
是 list()
的别名,也可以用在 by
.
dt[, .(
max.date = max(mydate),
mean.val = mean(val[id.sub == "b"])),
by = .(id, wk = format(mydate, "%Y-%V"))
]
# id wk max.date mean.val
# 1: 5 2009-01 2009-01-04 1.9335678
# 2: 2 2009-01 2009-01-03 NaN
# 3: 4 2009-02 2009-01-10 0.1603871
# 4: 3 2009-02 2009-01-11 NaN
# 5: 1 2009-02 2009-01-08 NaN
# ---
# 619: 3 2011-51 2011-12-24 NaN
# 620: 1 2011-52 2011-12-28 NaN
# 621: 4 2011-52 2011-12-29 -0.8534370
# 622: 2 2011-52 2011-12-31 -1.2628962
# 623: 3 2012-52 2012-01-01 -1.7779465
如果我们在分组后查看所有列,您就会明白为什么您的尝试不起作用。
names(dt[, .SD, by = .(id, wk = format(mydate, "%Y-%V"))])
# [1] "id" "wk" "id.sub" "val"
如我们所见,mydate
已不存在。我要到此为止,因为我不确定我是否可以提供有关原因的技术解释。正如akrun所说,这是因为它已经被修改了。