在使用“data.table”和“c()”进行汇总时,您能否防止因子强制转换?
Can you prevent factor coercion when summarising with `data.table` and `c()`?
我希望能够在 data.table
中同时使用汇总函数和标准表达式,我发现 c()
工作得很好,但它将因子强制转换为整数表示。
在 data.table
中有没有一种简单的方法,我可以同时使用命名列表摘要和带有因子值的摘要,并保留实际因子 class 而无需将其转换为整数?
library(data.table)
library(lubridate)
data <- data.table(date = ymd("2019-07-07","2018-05-04",
"2019-08-09","2017-06-03"),
colour = factor(c("red","blue","green","yellow")),
group = factor(c("A","B","A","B")),
value1 = c(5,23,3,1),
value2 = c(3,2,4,1))
summary_func <- function(x, var_name){
setNames(list(mean(x),
sd(x)), paste0(var_name,"_",c("mean","sd")))
}
data[,c(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
first_colour = colour[1]),
by = group]
结果:
group val1_mean val1_sd val2_mean val2_sd first_colour
1: A 4 1.414214 3.5 0.7071068 3
2: B 12 15.556349 1.5 0.7071068 1
我希望结果是:
group val1_mean val1_sd val2_mean val2_sd first_colour
1: A 4 1.414214 3.5 0.7071068 green
2: B 12 15.556349 1.5 0.7071068 red
我在下面取得了一些成功,但这些解决方案非常不优雅,我怀疑不是很普遍。因此,我希望有一个更简洁的data.table
方法来解决这个问题。
我尝试过的东西:
- 我发现我可以通过在列表摘要周围使用
list()
并给它们一个非常具体的命名约定 ("SF"
) 来实现结果。然后,您需要将列排序为列表列和非列表列,然后使用 cbind
、lapply
和 rbindlist
将列表强制转换为 data.tables。然后您必须重命名结果列。
tmp1 <- data[,.(first_colour = colour[1],
SF1 = list(summary_func(value1, "val1")),
SF2 = list(summary_func(value2, "val2"))),
by = group]
list_cols <- names(which(sapply(tmp1,is.list)))
grp_cols <- names(tmp1)[!names(tmp1) %in% list_cols]
tmp2 <- tmp1[, do.call(cbind,
c(lapply(mget(list_cols),rbindlist),
deparse.level = 0)), by = grp_cols]
setnames(tmp2, gsub("^SF\d\.", "", names(tmp2)))
tmp2
- 我发现如果您创建
c()
的替代版本,您可以获得所需的行为。您需要以特定方式解压缩参数以保留类型和名称。我认为虽然这相对于 c()
和 list()
可能非常慢,因为这两个函数都是原始函数,因此基于编译的 C 代码。
c_alt <- function(...){
blah <- list(...)
result <- list()
for(i in 1:length(blah)){
len <- length(blah[[i]])
for(j in 1:len){
result[[length(result) + 1]] <- blah[[i]][[j]]
}
if(len > 1){
names(result)[(length(result)-len):length(result)] <- names(blah[[i]])
}else{
names(result)[[length(result)]] <- names(blah)[[i]]
}
}
result
}
data[,c_alt(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
first_colour = colour[1]),
by = group]
一种方法是将 colour
转换为字符,提取第一个值并在需要时再次将其设为 factor
。
library(data.table)
data[,c(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
first_colour = as.character(colour[1])),
by = group][, first_colour := factor(first_colour)][]
# group val1_mean val1_sd val2_mean val2_sd first_colour
#1: A 4 1.414214 3.5 0.7071068 red
#2: B 12 15.556349 1.5 0.7071068 blue
如果您乐于使用 dplyr
和 tidyr
包,这将提供所需的输出,并且如果您想添加更多汇总功能,则可以扩展:
library(dplyr)
library(tidyr)
data %>%
pivot_longer(-c(date, colour, group), names_to = "column", values_to = "val") %>%
mutate(column = if_else(column == "value1", "val1", "val2")) %>%
group_by(group, column) %>%
summarise(mean = mean(val), sd = sd(val), colour = colour[1]) %>%
pivot_wider(id_cols = c(group, colour), names_from = column, values_from = c(mean, sd)) %>%
relocate(colour, .after = last_col())
c
不允许不同的类型:
All arguments are coerced to a common type which is the type of the returned value
您可以将 data.frame
与 cbind
一起使用。
这保留了列类型:
summary_func <- function(x, var_name){
setNames(data.frame(mean(x),
sd(x)), paste0(var_name,"_",c("mean","sd")))
}
data[,cbind(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
data.frame(first_colour = colour[1])),
by = group]
group val1_mean val1_sd val2_mean val2_sd first_colour
1: A 4 1.414214 3.5 0.7071068 red
2: B 12 15.556349 1.5 0.7071068 blue
我希望能够在 data.table
中同时使用汇总函数和标准表达式,我发现 c()
工作得很好,但它将因子强制转换为整数表示。
在 data.table
中有没有一种简单的方法,我可以同时使用命名列表摘要和带有因子值的摘要,并保留实际因子 class 而无需将其转换为整数?
library(data.table)
library(lubridate)
data <- data.table(date = ymd("2019-07-07","2018-05-04",
"2019-08-09","2017-06-03"),
colour = factor(c("red","blue","green","yellow")),
group = factor(c("A","B","A","B")),
value1 = c(5,23,3,1),
value2 = c(3,2,4,1))
summary_func <- function(x, var_name){
setNames(list(mean(x),
sd(x)), paste0(var_name,"_",c("mean","sd")))
}
data[,c(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
first_colour = colour[1]),
by = group]
结果:
group val1_mean val1_sd val2_mean val2_sd first_colour
1: A 4 1.414214 3.5 0.7071068 3
2: B 12 15.556349 1.5 0.7071068 1
我希望结果是:
group val1_mean val1_sd val2_mean val2_sd first_colour
1: A 4 1.414214 3.5 0.7071068 green
2: B 12 15.556349 1.5 0.7071068 red
我在下面取得了一些成功,但这些解决方案非常不优雅,我怀疑不是很普遍。因此,我希望有一个更简洁的data.table
方法来解决这个问题。
我尝试过的东西:
- 我发现我可以通过在列表摘要周围使用
list()
并给它们一个非常具体的命名约定 ("SF"
) 来实现结果。然后,您需要将列排序为列表列和非列表列,然后使用cbind
、lapply
和rbindlist
将列表强制转换为 data.tables。然后您必须重命名结果列。
tmp1 <- data[,.(first_colour = colour[1],
SF1 = list(summary_func(value1, "val1")),
SF2 = list(summary_func(value2, "val2"))),
by = group]
list_cols <- names(which(sapply(tmp1,is.list)))
grp_cols <- names(tmp1)[!names(tmp1) %in% list_cols]
tmp2 <- tmp1[, do.call(cbind,
c(lapply(mget(list_cols),rbindlist),
deparse.level = 0)), by = grp_cols]
setnames(tmp2, gsub("^SF\d\.", "", names(tmp2)))
tmp2
- 我发现如果您创建
c()
的替代版本,您可以获得所需的行为。您需要以特定方式解压缩参数以保留类型和名称。我认为虽然这相对于c()
和list()
可能非常慢,因为这两个函数都是原始函数,因此基于编译的 C 代码。
c_alt <- function(...){
blah <- list(...)
result <- list()
for(i in 1:length(blah)){
len <- length(blah[[i]])
for(j in 1:len){
result[[length(result) + 1]] <- blah[[i]][[j]]
}
if(len > 1){
names(result)[(length(result)-len):length(result)] <- names(blah[[i]])
}else{
names(result)[[length(result)]] <- names(blah)[[i]]
}
}
result
}
data[,c_alt(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
first_colour = colour[1]),
by = group]
一种方法是将 colour
转换为字符,提取第一个值并在需要时再次将其设为 factor
。
library(data.table)
data[,c(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
first_colour = as.character(colour[1])),
by = group][, first_colour := factor(first_colour)][]
# group val1_mean val1_sd val2_mean val2_sd first_colour
#1: A 4 1.414214 3.5 0.7071068 red
#2: B 12 15.556349 1.5 0.7071068 blue
如果您乐于使用 dplyr
和 tidyr
包,这将提供所需的输出,并且如果您想添加更多汇总功能,则可以扩展:
library(dplyr)
library(tidyr)
data %>%
pivot_longer(-c(date, colour, group), names_to = "column", values_to = "val") %>%
mutate(column = if_else(column == "value1", "val1", "val2")) %>%
group_by(group, column) %>%
summarise(mean = mean(val), sd = sd(val), colour = colour[1]) %>%
pivot_wider(id_cols = c(group, colour), names_from = column, values_from = c(mean, sd)) %>%
relocate(colour, .after = last_col())
c
不允许不同的类型:
All arguments are coerced to a common type which is the type of the returned value
您可以将 data.frame
与 cbind
一起使用。
这保留了列类型:
summary_func <- function(x, var_name){
setNames(data.frame(mean(x),
sd(x)), paste0(var_name,"_",c("mean","sd")))
}
data[,cbind(summary_func(value1,var_name = "val1"),
summary_func(value2,var_name = "val2"),
data.frame(first_colour = colour[1])),
by = group]
group val1_mean val1_sd val2_mean val2_sd first_colour
1: A 4 1.414214 3.5 0.7071068 red
2: B 12 15.556349 1.5 0.7071068 blue