.SD in data.table join 以引用 i 中的任意列列表
.SD in data.table join to refer to arbitrary list of columns in i
问题:根据连接键使用另一个 table 中的权重计算一个 table 列的加权平均值。
以下是 reprex 中的步骤:
library(data.table)
#DT1 table of values - here just 2 columns, but may be an arbitrary number
DT1 <- data.table(k1 = c('A1','A2','A3'),
k2 = c('X','X','Y'),
v1 = c(10,11,12),
v2 = c(.5, .6, 1.7))
#DT2 table of weights - columns correspond to value columns in table 1
DT2 <- data.table(k2 = c('X','Y'),
w1 = c(5,2),
w2 = c(1,7))
#Vectors of corresponding column names (could be any number of columns)
vals <- c('v1','v2')
weights <- c('w1','w2')
i.weights <- paste0('i.', weights)
#1. This returns all columns
DT1[DT2, on=.(k2)]
#> k1 k2 v1 v2 w1 w2
#> 1: A1 X 10 0.5 5 1
#> 2: A2 X 11 0.6 5 1
#> 3: A3 Y 12 1.7 2 7
#2. This use of SD is standard
DT1[DT2, on=.(k2), .SD, .SDcols = vals, by=.(k1)]
#> k1 v1 v2
#> 1: A1 10 0.5
#> 2: A2 11 0.6
#> 3: A3 12 1.7
#3. But refer to the columns of i (DT2) and it fails, both without and with the i. prefix
DT1[DT2, on=.(k2), .SD, .SDcols = weights, by=.(k1)]
#> Error in `[.data.table`(DT1, DT2, on = .(k2), .SD, .SDcols = weights, : Some items of .SDcols are not column names: [w1, w2]
DT1[DT2, on=.(k2), .SD, .SDcols = i.weights, by=.(k1)]
#> Error in `[.data.table`(DT1, DT2, on = .(k2), .SD, .SDcols = i.weights, : Some items of .SDcols are not column names: [i.w1, i.w2]
#4. So following suggestion in
# turn to mget() - in one command it fails
DT1[DT2, on=.(k2), c(mget(vals), mget(weights)), by=.(k1,k2)]
#> Error: value for 'w1' not found
#5. But by exploiting 1. above and splitting into chained queries we get success!
DT1[DT2, on=.(k2),][, c(mget(vals), mget(weights)), by=.(k1,k2)]
#> k1 k2 v1 v2 w1 w2
#> 1: A1 X 10 0.5 5 1
#> 2: A2 X 11 0.6 5 1
#> 3: A3 Y 12 1.7 2 7
#6. Now we can turn to the original intention, but no luck
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]
#> Error in x * w: non-numeric argument to binary operator
#7. One more step - turn the lists returned by mget to data.tables - hurrahh!
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(setDT(mget(vals)), setDT(mget(weights)))), by=.(k1,k2)]
#> k1 k2 wmean
#> 1: A1 X 8.416667
#> 2: A2 X 9.266667
#> 3: A3 Y 3.988889
由 reprex package (v2.0.0)
于 2021-11-26 创建
真的有这么难吗?有没有更直接(最好是性能更高)的方法来做到这一点?
推论 - 我实际上想用这个计算在 DT1 中创建一个新列,但由于这以两个链接查询结束,我无法在此命令中进行分配。我必须创建一个新的 table 并将其连接回原始以添加该列。有没有避免这个额外步骤的解决方案?
另一种方法是将数据从宽融合到长,然后相互连接。
molten_dt1 = melt(DT1, measure.vars = vals)[, variable := as.integer(substring(variable, 2))]
molten_dt2 = melt(DT2, measure.vars = weights)[, variable := as.integer(substring(variable, 2))]
molten_dt1[molten_dt2,
on = .(k2, variable)
][,
weighted.mean(value, i.value),
by = .(k1, k2)]
之所以不简单,是因为我们需要进行并行列查找(即 v1 * w1
和 v2 * w2
),复杂性总是会增加,因为我们需要考虑这种关系列之间。融化数据使我们能够简化我们的方法,因为数据结构允许我们加入,而且我们在 weighted.mean
中使用向量而不是 data.frames.
另一个注意事项是,如果您为列表创建一个新的 weighted.mean()
方法,这允许我们跳过 setDT
要求,那么您可以简化原始方法。
## slight changes made to stats:::weighted.mean.default
weighted.mean.list = function (x, w, ..., na.rm = FALSE)
{
x = unlist(x)
if (missing(w)) {
if (na.rm)
x <- x[!is.na(x)]
return(sum(x)/length(x))
}
w = unlist(w)
if (length(w) != length(x))
stop("'x' and 'w' must have the same length")
if (na.rm) {
i <- !is.na(x)
w <- w[i]
x <- x[i]
}
sum((x * w)[w != 0])/sum(w)
}
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]
问题:根据连接键使用另一个 table 中的权重计算一个 table 列的加权平均值。
以下是 reprex 中的步骤:
library(data.table)
#DT1 table of values - here just 2 columns, but may be an arbitrary number
DT1 <- data.table(k1 = c('A1','A2','A3'),
k2 = c('X','X','Y'),
v1 = c(10,11,12),
v2 = c(.5, .6, 1.7))
#DT2 table of weights - columns correspond to value columns in table 1
DT2 <- data.table(k2 = c('X','Y'),
w1 = c(5,2),
w2 = c(1,7))
#Vectors of corresponding column names (could be any number of columns)
vals <- c('v1','v2')
weights <- c('w1','w2')
i.weights <- paste0('i.', weights)
#1. This returns all columns
DT1[DT2, on=.(k2)]
#> k1 k2 v1 v2 w1 w2
#> 1: A1 X 10 0.5 5 1
#> 2: A2 X 11 0.6 5 1
#> 3: A3 Y 12 1.7 2 7
#2. This use of SD is standard
DT1[DT2, on=.(k2), .SD, .SDcols = vals, by=.(k1)]
#> k1 v1 v2
#> 1: A1 10 0.5
#> 2: A2 11 0.6
#> 3: A3 12 1.7
#3. But refer to the columns of i (DT2) and it fails, both without and with the i. prefix
DT1[DT2, on=.(k2), .SD, .SDcols = weights, by=.(k1)]
#> Error in `[.data.table`(DT1, DT2, on = .(k2), .SD, .SDcols = weights, : Some items of .SDcols are not column names: [w1, w2]
DT1[DT2, on=.(k2), .SD, .SDcols = i.weights, by=.(k1)]
#> Error in `[.data.table`(DT1, DT2, on = .(k2), .SD, .SDcols = i.weights, : Some items of .SDcols are not column names: [i.w1, i.w2]
#4. So following suggestion in
# turn to mget() - in one command it fails
DT1[DT2, on=.(k2), c(mget(vals), mget(weights)), by=.(k1,k2)]
#> Error: value for 'w1' not found
#5. But by exploiting 1. above and splitting into chained queries we get success!
DT1[DT2, on=.(k2),][, c(mget(vals), mget(weights)), by=.(k1,k2)]
#> k1 k2 v1 v2 w1 w2
#> 1: A1 X 10 0.5 5 1
#> 2: A2 X 11 0.6 5 1
#> 3: A3 Y 12 1.7 2 7
#6. Now we can turn to the original intention, but no luck
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]
#> Error in x * w: non-numeric argument to binary operator
#7. One more step - turn the lists returned by mget to data.tables - hurrahh!
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(setDT(mget(vals)), setDT(mget(weights)))), by=.(k1,k2)]
#> k1 k2 wmean
#> 1: A1 X 8.416667
#> 2: A2 X 9.266667
#> 3: A3 Y 3.988889
由 reprex package (v2.0.0)
于 2021-11-26 创建真的有这么难吗?有没有更直接(最好是性能更高)的方法来做到这一点?
推论 - 我实际上想用这个计算在 DT1 中创建一个新列,但由于这以两个链接查询结束,我无法在此命令中进行分配。我必须创建一个新的 table 并将其连接回原始以添加该列。有没有避免这个额外步骤的解决方案?
另一种方法是将数据从宽融合到长,然后相互连接。
molten_dt1 = melt(DT1, measure.vars = vals)[, variable := as.integer(substring(variable, 2))]
molten_dt2 = melt(DT2, measure.vars = weights)[, variable := as.integer(substring(variable, 2))]
molten_dt1[molten_dt2,
on = .(k2, variable)
][,
weighted.mean(value, i.value),
by = .(k1, k2)]
之所以不简单,是因为我们需要进行并行列查找(即 v1 * w1
和 v2 * w2
),复杂性总是会增加,因为我们需要考虑这种关系列之间。融化数据使我们能够简化我们的方法,因为数据结构允许我们加入,而且我们在 weighted.mean
中使用向量而不是 data.frames.
另一个注意事项是,如果您为列表创建一个新的 weighted.mean()
方法,这允许我们跳过 setDT
要求,那么您可以简化原始方法。
## slight changes made to stats:::weighted.mean.default
weighted.mean.list = function (x, w, ..., na.rm = FALSE)
{
x = unlist(x)
if (missing(w)) {
if (na.rm)
x <- x[!is.na(x)]
return(sum(x)/length(x))
}
w = unlist(w)
if (length(w) != length(x))
stop("'x' and 'w' must have the same length")
if (na.rm) {
i <- !is.na(x)
w <- w[i]
x <- x[i]
}
sum((x * w)[w != 0])/sum(w)
}
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]