分组时 data.table 的 j 参数的预编程组件

Question

我有一个很大的 data.table，我经常用它编程并重复执行以下几项操作：

d.regionOffice <- d.input[, .(sales = sum(sales)), .(region, office)]

d.region <- d.regionOffice[, .(sales = sum(sales)), .(region)]

除了 sales = sum(sales) 之外，我还有其他变量经常重复使用，通常变量名更长。

有没有一种方法可以捕获这个通用结构，然后在 data.table 中使用？

我试过像这样天真的事情：

l.sales <- list(sales = sum(sales))

但是 R 会给你一个错误说 "Error: object 'sales' not found"。有解决办法吗？

请注意，我有多个常见的摘要统计信息，例如profit = sum(profit)、customers = sum(customers) 等，因此仅需要 by 参数的自定义函数不够好。

Answer 1

如果我没理解错的话，OP 正在寻找一种捷径来创建更少输入的聚合。

而不是键入

library(data.table)
DT <- as.data.table(iris)

DT[, .(Sepal.Length = mean(Sepal.Length), Petal.Length = mean(Petal.Length)), by = Species]

      Species Sepal.Length Petal.Length
1:     setosa        5.006        1.462
2: versicolor        5.936        4.260
3:  virginica        6.588        5.552

我们可以写

cols <- c("Sepal.Length", "Petal.Length")
DT[, lapply(.SD, mean), .SDcols = cols, by = Species]

      Species Sepal.Length Petal.Length
1:     setosa        5.006        1.462
2: versicolor        5.936        4.260
3:  virginica        6.588        5.552

为了方便起见，这可以放在一个函数中：

agg <- function(dt, cols, grp, fct = sum) {
  dt[, lapply(.SD, fct), .SDcols = cols, by = grp]
}

agg(DT, cols, "Species", mean)

      Species Sepal.Length Petal.Length
1:     setosa        5.006        1.462
2: versicolor        5.936        4.260
3:  virginica        6.588        5.552

# using default aggregation function
agg(DT, cols, "Species")

      Species Sepal.Length Petal.Length
1:     setosa        250.3         73.1
2: versicolor        296.8        213.0
3:  virginica        329.4        277.6

# totals without grouping
agg(DT, cols, , mean)

   Sepal.Length Petal.Length
1:     5.843333        3.758

或者，与另一个 data.table

DT2 <- as.data.table(mtcars, keep.rownames = TRUE)
agg(DT2, c("wt", "hp"), "cyl", sum)

   cyl     wt   hp
1:   6 21.820  856
2:   4 25.143  909
3:   8 55.989 2929

agg(DT2, c("wt", "hp"), "cyl", length)

   cyl wt hp
1:   6  7  7
2:   4 11 11
3:   8 14 14

Answer 2

另一种解决方案是只使用 R 中的代码片段。最初的问题是减少重复输入的数量，这可以使用上述解决方案以编程方式完成，或者使用 RStudio 中的代码片段半手动完成.

在 RStudio 中转到：工具 > 全局选项 > 代码 > 编辑片段（在底部）

然后添加一个片段，例如

snippet gwp
    gross.written.premium = sum(gross.written.premium)

然后当您输入代码时，您只需输入 gwp[tab] 即可展开为完整代码。

分组时 data.table 的 j 参数的预编程组件

Pre-programming components for data.table's j argument when grouping

environment

r

data.table