使用`{data.table}`编程:如何命名新列?

Programming with `{data.table}`: how to name a new column?

以下问题在使用 data.table 进行编程时似乎非常基础,如果重复,我深表歉意。我花了时间研究但找不到答案。

我想创建一个围绕 data.table 争论过程的“用户定义函数”。在此过程中,创建了一个新列,我想让用户设置该新列的名称。

例子

考虑以下按原样运行的代码。我想将它包装在一个函数中。

library(data.table)
library(magrittr)
library(tibble)

mtcars %>%
  as.data.table() %>%
  .[, .(max_mpg = max(mpg)), by = cyl] %>%
  as_tibble()
#> # A tibble: 3 x 2
#>     cyl max_mpg
#>   <dbl>   <dbl>
#> 1     6    21.4
#> 2     4    33.9
#> 3     8    19.2

reprex package (v0.3.0)

于 2021-10-13 创建

我想让我的函数做的就是让用户设置 new_colname_of_choice 的名称:

my_wrapper <- function(new_colname_of_choice) {
  mtcars %>%
    as.data.table() %>%
    .[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
    as_tibble()
}


my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#>     cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#>   <dbl>                 <dbl>
#> 1     6                  21.4
#> 2     4                  33.9
#> 3     8                  19.2

我试过使用花括号也没有用(实际上抛出了一个错误):

my_wrapper_2 <- function(new_colname_of_choice) {
  
  mtcars %>%
    as.data.table() %>%
    .[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
    as_tibble()
  
}

Error: unexpected '=' in: " as.data.table() %>% .[, .({new_colname_of_choice} ="

这令人惊讶,因为花括号确实提升了所需的命名能力,但在不同(但相似)的代码中:

my_wrapper_3 <- function(new_colname_of_choice) {
  mtcars %>%
    as.data.table() %>%
    .[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
    as_tibble()
}


my_wrapper_3(new_colname_of_choice = "my_lovely_colname")

## # A tibble: 32 x 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb my_lovely_colname <---- SUCCESS!
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>             <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4              21.4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4              21.4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1              33.9
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1              21.4
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2              19.2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1              21.4
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4              19.2
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2              33.9
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2              33.9
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4              21.4
## # ... with 22 more rows

底线

我的结论是 = 运算符对 LHS 上的 {...} 敏感。我怎样才能在初始 my_wrapper() 示例中将名称(来自参数)传递给 LHS?


编辑


我想为同一问题添加 dplyr 解决方案,取自 programming with dplyr vignette:

library(dplyr)

my_wrapper_dplyr <- function(new_colname_of_choice) {
  mtcars %>%
    group_by(cyl) %>%
    summarise("{new_colname_of_choice}" := max(mpg))
}

my_wrapper_dplyr("another_lovely_colname")

它非常强大并且适用于我遇到的所有命名情况。 data.table 中是否有类似于 {dplyr} 的 built-in/canonical 实践?

您可以做的一件事是将列的创建和列的命名分开,如下所示:

my_wrapper <- function(new_colname_of_choice) {
  mtcars %>%
    as.data.table() %>%
    .[, .(tempcol = max(mpg)), by = cyl] %>%
    setnames(., "tempcol", new_colname_of_choice) %>%
    as.tibble()
}

my_wrapper("my_lovely_colname")

使用此方法,您可以使用 .(tempcol = max(mpg))tempcol := max(mpg)

使用 stats 中的 setNames

my_wrapper <- function(new_colname_of_choice) {
      
      mtcars %>%
        as.data.table() %>%
        .[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
        as_tibble()
    }
    
    
    my_wrapper(new_colname_of_choice = "my_lovely_colname")

随着即将推出的 data.table version 1.14.3,您将能够使用新的 env 参数:

A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.

# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz",  repo = NULL, type = "source")

library(tibble)
library(data.table)

my_wrapper_new <- function(new_colname_of_choice) {
  
  mtcars %>%
    as.data.table() %>%
    .[, .(new_colname_of_choice = max(mpg)), by = cyl, 
      env=list(new_colname_of_choice = new_colname_of_choice)] %>%
    as_tibble()
  
}

my_wrapper_new('test')

# A tibble: 3 x 2
    cyl  test
  <dbl> <dbl>
1     6  21.4
2     4  33.9
3     8  19.2