使用`{data.table}`编程:如何命名新列?
Programming with `{data.table}`: how to name a new column?
以下问题在使用 data.table
进行编程时似乎非常基础,如果重复,我深表歉意。我花了时间研究但找不到答案。
我想创建一个围绕 data.table
争论过程的“用户定义函数”。在此过程中,创建了一个新列,我想让用户设置该新列的名称。
例子
考虑以下按原样运行的代码。我想将它包装在一个函数中。
library(data.table)
library(magrittr)
library(tibble)
mtcars %>%
as.data.table() %>%
.[, .(max_mpg = max(mpg)), by = cyl] %>%
as_tibble()
#> # A tibble: 3 x 2
#> cyl max_mpg
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
由 reprex package (v0.3.0)
于 2021-10-13 创建
我想让我的函数做的就是让用户设置 new_colname_of_choice
的名称:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#> cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
我试过使用花括号也没有用(实际上抛出了一个错误):
my_wrapper_2 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
as_tibble()
}
Error: unexpected '=' in:
" as.data.table() %>%
.[, .({new_colname_of_choice} ="
这令人惊讶,因为花括号确实提升了所需的命名能力,但在不同(但相似)的代码中:
my_wrapper_3 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
as_tibble()
}
my_wrapper_3(new_colname_of_choice = "my_lovely_colname")
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb my_lovely_colname <---- SUCCESS!
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 33.9
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 19.2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.4
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 19.2
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.9
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 33.9
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.4
## # ... with 22 more rows
底线
我的结论是 =
运算符对 LHS 上的 {...}
敏感。我怎样才能在初始 my_wrapper()
示例中将名称(来自参数)传递给 LHS?
编辑
我想为同一问题添加 dplyr
解决方案,取自 programming with dplyr vignette:
library(dplyr)
my_wrapper_dplyr <- function(new_colname_of_choice) {
mtcars %>%
group_by(cyl) %>%
summarise("{new_colname_of_choice}" := max(mpg))
}
my_wrapper_dplyr("another_lovely_colname")
它非常强大并且适用于我遇到的所有命名情况。 data.table
中是否有类似于 {dplyr}
的 built-in/canonical 实践?
您可以做的一件事是将列的创建和列的命名分开,如下所示:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(tempcol = max(mpg)), by = cyl] %>%
setnames(., "tempcol", new_colname_of_choice) %>%
as.tibble()
}
my_wrapper("my_lovely_colname")
使用此方法,您可以使用 .(tempcol = max(mpg))
或 tempcol := max(mpg)
使用 stats
中的 setNames
:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
随着即将推出的 data.table
version 1.14.3,您将能够使用新的 env
参数:
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.
# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz", repo = NULL, type = "source")
library(tibble)
library(data.table)
my_wrapper_new <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl,
env=list(new_colname_of_choice = new_colname_of_choice)] %>%
as_tibble()
}
my_wrapper_new('test')
# A tibble: 3 x 2
cyl test
<dbl> <dbl>
1 6 21.4
2 4 33.9
3 8 19.2
以下问题在使用 data.table
进行编程时似乎非常基础,如果重复,我深表歉意。我花了时间研究但找不到答案。
我想创建一个围绕 data.table
争论过程的“用户定义函数”。在此过程中,创建了一个新列,我想让用户设置该新列的名称。
例子
考虑以下按原样运行的代码。我想将它包装在一个函数中。
library(data.table)
library(magrittr)
library(tibble)
mtcars %>%
as.data.table() %>%
.[, .(max_mpg = max(mpg)), by = cyl] %>%
as_tibble()
#> # A tibble: 3 x 2
#> cyl max_mpg
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
由 reprex package (v0.3.0)
于 2021-10-13 创建我想让我的函数做的就是让用户设置 new_colname_of_choice
的名称:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#> cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
我试过使用花括号也没有用(实际上抛出了一个错误):
my_wrapper_2 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
as_tibble()
}
Error: unexpected '=' in: " as.data.table() %>% .[, .({new_colname_of_choice} ="
这令人惊讶,因为花括号确实提升了所需的命名能力,但在不同(但相似)的代码中:
my_wrapper_3 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
as_tibble()
}
my_wrapper_3(new_colname_of_choice = "my_lovely_colname")
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb my_lovely_colname <---- SUCCESS!
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 33.9
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 19.2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.4
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 19.2
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.9
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 33.9
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.4
## # ... with 22 more rows
底线
我的结论是 =
运算符对 LHS 上的 {...}
敏感。我怎样才能在初始 my_wrapper()
示例中将名称(来自参数)传递给 LHS?
编辑
我想为同一问题添加 dplyr
解决方案,取自 programming with dplyr vignette:
library(dplyr)
my_wrapper_dplyr <- function(new_colname_of_choice) {
mtcars %>%
group_by(cyl) %>%
summarise("{new_colname_of_choice}" := max(mpg))
}
my_wrapper_dplyr("another_lovely_colname")
它非常强大并且适用于我遇到的所有命名情况。 data.table
中是否有类似于 {dplyr}
的 built-in/canonical 实践?
您可以做的一件事是将列的创建和列的命名分开,如下所示:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(tempcol = max(mpg)), by = cyl] %>%
setnames(., "tempcol", new_colname_of_choice) %>%
as.tibble()
}
my_wrapper("my_lovely_colname")
使用此方法,您可以使用 .(tempcol = max(mpg))
或 tempcol := max(mpg)
使用 stats
中的 setNames
:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
随着即将推出的 data.table
version 1.14.3,您将能够使用新的 env
参数:
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.
# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz", repo = NULL, type = "source")
library(tibble)
library(data.table)
my_wrapper_new <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl,
env=list(new_colname_of_choice = new_colname_of_choice)] %>%
as_tibble()
}
my_wrapper_new('test')
# A tibble: 3 x 2
cyl test
<dbl> <dbl>
1 6 21.4
2 4 33.9
3 8 19.2