R中嵌套巨大列表的高效汇总统计
Efficient summary statistics of nested huge list in R
我正在 运行 进行模拟研究,我的结果存储在嵌套列表结构中。列表的第一层代表模型生成的不同超参数。第二层是相同模型的复制次数(换种子)。
在下面的示例中,我列出了一个由两个超参数(hyperpar1
和 hyperpar2
)控制的模型的输出,其中两个超参数都可以采用 2 个不同的值,从而导致 4 种不同的组合结果模型。此外,4 种可能的组合中的每一种都是 运行 两次(不同的种子),导致八种可能的组合(如下面 str(res, max = 2)
所示)。最后,从模型的每个可能迭代中恢复了两个性能指标(metric1
和 metric2
)。
我的问题是,在我的真实数据中,迭代次数(列表的第二级)很大(高达 10000),并且在某些情况下超参数数量的全阶乘高达 2000。因此退市过程变得相当缓慢
下面我列出了我当前的程序和我想要的输出,但同样,它相对较慢。特别是,有一个部分,当我取消列出所有内容时,我把它放在一个大的 data.frame 中,这需要很长时间,但我还没有以更快的方式解决这个问题。
res <-list(
list(list(modeltype = "tree", time_iter = structure(0.7099, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4847, metric2 = 0.2576 ),
list(modeltype = "tree", time_iter = structure(0.058 , class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4013, metric2 = 0.2569 )),
list(list(modeltype = "tree", time_iter = structure(0.046 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.4755, metric2 = 0.2988 ),
list(modeltype = "tree", time_iter = structure(0.0474, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.2413, metric2 = 0.2147 )),
list(list(modeltype = "tree", time_iter = structure(0.0502, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.7131, metric2 = 0.5024 ),
list(modeltype = "tree", time_iter = structure(2.9419, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.4254, metric2 = 0.2824 )),
list(list(modeltype = "tree", time_iter = structure(0.041 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.6709, metric2 = 0.4092 ),
list(modeltype = "tree", time_iter = structure(0.0396, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.4585, metric2 = 0.4115 )))
hyperpar1 <- c(0.5 , 0.8 )
hyperpar2 <- c(0.5 , 1 )
expand.grid(hyperpar1 = hyperpar1, hyperpar2 = hyperpar2)
# hyperpar1 hyperpar2
# 1 0.5 0.5
# 2 0.8 0.5
# 3 0.5 1.0
# 4 0.8 1.0
#List structure:
#The 4 elements represent the 4 combinations of the hyperparams
#Inside each of the 4 combinations of the hyperparams, 2 lists represent the 2 simulations (with different seeds)
str(res, max = 1)
#Finally, inside each of the final level (level=3) there is a list of 10 objects that are the results of each simulation
str(res, max = 2)
# List of 4
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
#e.g Fist iteration of first model
t(res[[1]][[1]])
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# [1,] "tree" 0.7099 1 75 0.5 0.5 0.4847 0.2576
正在解析列表
在下面的代码中,我取消嵌套列表并将所有内容放入 data.frame。
#Unlist the nested structure of the list `res`
all_in_list <- lapply(1:length(res), function(i) {
unlisting <- unlist(res[i],recursive = FALSE)
to_df <- do.call(rbind, lapply(unlisting, as.data.frame))
return(to_df)})
#Here is where averything get really really slow when the list is huge
all_in_df <- do.call(rbind, lapply(all_in_list, as.data.frame))
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# 1 tree 0.7099 secs 1 75 0.5 0.5 0.4847 0.2576
# 2 tree 0.0580 secs 2 75 0.5 0.5 0.4013 0.2569
# 3 tree 0.0460 secs 1 75 0.8 0.5 0.4755 0.2988
# 4 tree 0.0474 secs 2 75 0.8 0.5 0.2413 0.2147
# 5 tree 0.0502 secs 1 75 0.5 1.0 0.7131 0.5024
# 6 tree 2.9419 secs 2 75 0.5 1.0 0.4254 0.2824
# 7 tree 0.0410 secs 1 75 0.8 1.0 0.6709 0.4092
# 8 tree 0.0396 secs 2 75 0.8 1.0 0.4585 0.4115
输出汇总统计
在下文中,我恢复了性能指标的均值和标准差,将它们的子集添加到每个 colname
以便稍后识别(绘图目的)。
#auxiliar function to compute the metrics at aggregate level.
foo_summary <- function(df ,
metrics =c("time_iter","metric1", "metric2") ,
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
{
#compute the aggregate metrics
out <- as.data.frame(aggregate(
x = df[metrics],
by = df[by],
FUN = summary_function,
na.rm = TRUE))
#rename conviniently the metric computed
oldnames <- colnames(out[metrics])
names(out)[match(oldnames,names(out))] <- paste(colnames(out[metrics]),
as.character(substitute(summary_function)),
sep = "_")
return(out)
}
df_mean <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
df_sd <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = sd)
final_df <- merge(df_mean,df_sd )
终于得到想要的输出了。
# nobs hyperpar1 hyperpar2 modeltype time_iter_mean metric1_mean metric2_mean time_iter_sd metric1_sd metric2_sd
# 1 75 0.5 0.5 tree 0.38395 secs 0.44300 0.25725 0.4609629107 0.05897271 0.0004949747
# 2 75 0.5 1.0 tree 1.49605 secs 0.56925 0.39240 2.0447406792 0.20343462 0.1555634919
# 3 75 0.8 0.5 tree 0.04670 secs 0.35840 0.25675 0.0009899495 0.16560441 0.0594676803
# 4 75 0.8 1.0 tree 0.04030 secs 0.56470 0.41035 0.0009899495 0.15018948 0.0016263456
你可以试试 data.table
:
library(data.table)
tmp = data.table(res)
tmp = tmp[, t(res[1]), by=1:nrow(tmp)]
tmp = tmp[, V1[[1]], by=1:nrow(tmp)]
g = function(x) list(mean = mean(x), sd = sd(x))
tmp[, unlist(lapply(.SD, g), recursive=FALSE)
, .SDcols=hyperpar1:metric2,
, by=.(nobs, hyperpar1, hyperpar2, modeltype)]
#> nobs hyperpar1 hyperpar2 modeltype hyperpar1.mean hyperpar1.sd
#> 1: 75 0.5 0.5 tree 0.5 0
#> 2: 75 0.8 0.5 tree 0.8 0
#> 3: 75 0.5 1.0 tree 0.5 0
#> 4: 75 0.8 1.0 tree 0.8 0
#> hyperpar2.mean hyperpar2.sd metric1.mean metric1.sd metric2.mean
#> 1: 0.5 0 0.44300 0.05897271 0.25725
#> 2: 0.5 0 0.35840 0.16560441 0.25675
#> 3: 1.0 0 0.56925 0.20343462 0.39240
#> 4: 1.0 0 0.56470 0.15018948 0.41035
#> metric2.sd
#> 1: 0.0004949747
#> 2: 0.0594676803
#> 3: 0.1555634919
#> 4: 0.0016263456
此代码使用列表列的连续取消嵌套,这是我在本笔记本中描述的一种策略:http://arelbundock.com/posts/datatable_nesting/
使用dplyr::bind_rows()
,直接将嵌套列表解嵌为data.frame,然后直接计算汇总统计:
library(dplyr)
bind_rows(res) %>%
group_by(modeltype, nobs, hyperpar1, hyperpar2) %>%
summarize(across(everything(), list(mean = mean, sd = sd)), .groups = "drop")
#> # A tibble: 4 x 12
#> modeltype nobs hyperpar1 hyperpar2 time_iter_mean time_iter_sd seed_mean
#> <chr> <dbl> <dbl> <dbl> <drtn> <dbl> <dbl>
#> 1 tree 75 0.5 0.5 0.38395 secs 0.461 1.5
#> 2 tree 75 0.5 1 1.49605 secs 2.04 1.5
#> 3 tree 75 0.8 0.5 0.04670 secs 0.000990 1.5
#> 4 tree 75 0.8 1 0.04030 secs 0.000990 1.5
#> # … with 5 more variables: seed_sd <dbl>, metric1_mean <dbl>, metric1_sd <dbl>,
#> # metric2_mean <dbl>, metric2_sd <dbl>
我正在 运行 进行模拟研究,我的结果存储在嵌套列表结构中。列表的第一层代表模型生成的不同超参数。第二层是相同模型的复制次数(换种子)。
在下面的示例中,我列出了一个由两个超参数(hyperpar1
和 hyperpar2
)控制的模型的输出,其中两个超参数都可以采用 2 个不同的值,从而导致 4 种不同的组合结果模型。此外,4 种可能的组合中的每一种都是 运行 两次(不同的种子),导致八种可能的组合(如下面 str(res, max = 2)
所示)。最后,从模型的每个可能迭代中恢复了两个性能指标(metric1
和 metric2
)。
我的问题是,在我的真实数据中,迭代次数(列表的第二级)很大(高达 10000),并且在某些情况下超参数数量的全阶乘高达 2000。因此退市过程变得相当缓慢
下面我列出了我当前的程序和我想要的输出,但同样,它相对较慢。特别是,有一个部分,当我取消列出所有内容时,我把它放在一个大的 data.frame 中,这需要很长时间,但我还没有以更快的方式解决这个问题。
res <-list(
list(list(modeltype = "tree", time_iter = structure(0.7099, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4847, metric2 = 0.2576 ),
list(modeltype = "tree", time_iter = structure(0.058 , class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4013, metric2 = 0.2569 )),
list(list(modeltype = "tree", time_iter = structure(0.046 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.4755, metric2 = 0.2988 ),
list(modeltype = "tree", time_iter = structure(0.0474, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.2413, metric2 = 0.2147 )),
list(list(modeltype = "tree", time_iter = structure(0.0502, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.7131, metric2 = 0.5024 ),
list(modeltype = "tree", time_iter = structure(2.9419, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.4254, metric2 = 0.2824 )),
list(list(modeltype = "tree", time_iter = structure(0.041 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.6709, metric2 = 0.4092 ),
list(modeltype = "tree", time_iter = structure(0.0396, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.4585, metric2 = 0.4115 )))
hyperpar1 <- c(0.5 , 0.8 )
hyperpar2 <- c(0.5 , 1 )
expand.grid(hyperpar1 = hyperpar1, hyperpar2 = hyperpar2)
# hyperpar1 hyperpar2
# 1 0.5 0.5
# 2 0.8 0.5
# 3 0.5 1.0
# 4 0.8 1.0
#List structure:
#The 4 elements represent the 4 combinations of the hyperparams
#Inside each of the 4 combinations of the hyperparams, 2 lists represent the 2 simulations (with different seeds)
str(res, max = 1)
#Finally, inside each of the final level (level=3) there is a list of 10 objects that are the results of each simulation
str(res, max = 2)
# List of 4
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
#e.g Fist iteration of first model
t(res[[1]][[1]])
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# [1,] "tree" 0.7099 1 75 0.5 0.5 0.4847 0.2576
正在解析列表
在下面的代码中,我取消嵌套列表并将所有内容放入 data.frame。
#Unlist the nested structure of the list `res`
all_in_list <- lapply(1:length(res), function(i) {
unlisting <- unlist(res[i],recursive = FALSE)
to_df <- do.call(rbind, lapply(unlisting, as.data.frame))
return(to_df)})
#Here is where averything get really really slow when the list is huge
all_in_df <- do.call(rbind, lapply(all_in_list, as.data.frame))
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# 1 tree 0.7099 secs 1 75 0.5 0.5 0.4847 0.2576
# 2 tree 0.0580 secs 2 75 0.5 0.5 0.4013 0.2569
# 3 tree 0.0460 secs 1 75 0.8 0.5 0.4755 0.2988
# 4 tree 0.0474 secs 2 75 0.8 0.5 0.2413 0.2147
# 5 tree 0.0502 secs 1 75 0.5 1.0 0.7131 0.5024
# 6 tree 2.9419 secs 2 75 0.5 1.0 0.4254 0.2824
# 7 tree 0.0410 secs 1 75 0.8 1.0 0.6709 0.4092
# 8 tree 0.0396 secs 2 75 0.8 1.0 0.4585 0.4115
输出汇总统计
在下文中,我恢复了性能指标的均值和标准差,将它们的子集添加到每个 colname
以便稍后识别(绘图目的)。
#auxiliar function to compute the metrics at aggregate level.
foo_summary <- function(df ,
metrics =c("time_iter","metric1", "metric2") ,
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
{
#compute the aggregate metrics
out <- as.data.frame(aggregate(
x = df[metrics],
by = df[by],
FUN = summary_function,
na.rm = TRUE))
#rename conviniently the metric computed
oldnames <- colnames(out[metrics])
names(out)[match(oldnames,names(out))] <- paste(colnames(out[metrics]),
as.character(substitute(summary_function)),
sep = "_")
return(out)
}
df_mean <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
df_sd <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = sd)
final_df <- merge(df_mean,df_sd )
终于得到想要的输出了。
# nobs hyperpar1 hyperpar2 modeltype time_iter_mean metric1_mean metric2_mean time_iter_sd metric1_sd metric2_sd
# 1 75 0.5 0.5 tree 0.38395 secs 0.44300 0.25725 0.4609629107 0.05897271 0.0004949747
# 2 75 0.5 1.0 tree 1.49605 secs 0.56925 0.39240 2.0447406792 0.20343462 0.1555634919
# 3 75 0.8 0.5 tree 0.04670 secs 0.35840 0.25675 0.0009899495 0.16560441 0.0594676803
# 4 75 0.8 1.0 tree 0.04030 secs 0.56470 0.41035 0.0009899495 0.15018948 0.0016263456
你可以试试 data.table
:
library(data.table)
tmp = data.table(res)
tmp = tmp[, t(res[1]), by=1:nrow(tmp)]
tmp = tmp[, V1[[1]], by=1:nrow(tmp)]
g = function(x) list(mean = mean(x), sd = sd(x))
tmp[, unlist(lapply(.SD, g), recursive=FALSE)
, .SDcols=hyperpar1:metric2,
, by=.(nobs, hyperpar1, hyperpar2, modeltype)]
#> nobs hyperpar1 hyperpar2 modeltype hyperpar1.mean hyperpar1.sd
#> 1: 75 0.5 0.5 tree 0.5 0
#> 2: 75 0.8 0.5 tree 0.8 0
#> 3: 75 0.5 1.0 tree 0.5 0
#> 4: 75 0.8 1.0 tree 0.8 0
#> hyperpar2.mean hyperpar2.sd metric1.mean metric1.sd metric2.mean
#> 1: 0.5 0 0.44300 0.05897271 0.25725
#> 2: 0.5 0 0.35840 0.16560441 0.25675
#> 3: 1.0 0 0.56925 0.20343462 0.39240
#> 4: 1.0 0 0.56470 0.15018948 0.41035
#> metric2.sd
#> 1: 0.0004949747
#> 2: 0.0594676803
#> 3: 0.1555634919
#> 4: 0.0016263456
此代码使用列表列的连续取消嵌套,这是我在本笔记本中描述的一种策略:http://arelbundock.com/posts/datatable_nesting/
使用dplyr::bind_rows()
,直接将嵌套列表解嵌为data.frame,然后直接计算汇总统计:
library(dplyr)
bind_rows(res) %>%
group_by(modeltype, nobs, hyperpar1, hyperpar2) %>%
summarize(across(everything(), list(mean = mean, sd = sd)), .groups = "drop")
#> # A tibble: 4 x 12
#> modeltype nobs hyperpar1 hyperpar2 time_iter_mean time_iter_sd seed_mean
#> <chr> <dbl> <dbl> <dbl> <drtn> <dbl> <dbl>
#> 1 tree 75 0.5 0.5 0.38395 secs 0.461 1.5
#> 2 tree 75 0.5 1 1.49605 secs 2.04 1.5
#> 3 tree 75 0.8 0.5 0.04670 secs 0.000990 1.5
#> 4 tree 75 0.8 1 0.04030 secs 0.000990 1.5
#> # … with 5 more variables: seed_sd <dbl>, metric1_mean <dbl>, metric1_sd <dbl>,
#> # metric2_mean <dbl>, metric2_sd <dbl>