从大数据 table 提取列到小数据 tables 并保存在列表中

Extract columns from big data table to small data tables and save in a list

我从外部服务器获得数据table(不同产品的时间序列取决于日期),它可以有以下最大列数(日期总是第一列,所有其他列可以存在或不存在,或者只有两个额外的列,或其他):

set.seed(123)
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 365),
                      'DEB Cal-2019' = rnorm(365, 2, 1), 'DEB Cal-2021' = rnorm(365, 2, 1),
                      'DEB Cal-2022' = rnorm(365, 2, 1), 'DEB Cal-2023' = rnorm(365, 2, 1),
                      'ATB Cal-2019' = rnorm(365, 2, 1), 'ATB Cal-2021' = rnorm(365, 2, 1),
                      'ATB Cal-2022' = rnorm(365, 2, 1), 'ATB Cal-2023' = rnorm(365, 2, 1),
                      'TTF Cal-2019' = rnorm(365, 2, 1), 'TTF Cal-2021' = rnorm(365, 2, 1),
                      'TTF Cal-2022' = rnorm(365, 2, 1), 'TTF Cal-2023' = rnorm(365, 2, 1),
                      'NCG Cal-2019' = rnorm(365, 2, 1), 'NCG Cal-2021' = rnorm(365, 2, 1),
                      'NCG Cal-2022' = rnorm(365, 2, 1), 'NCG Cal-2023' = rnorm(365, 2, 1),
                      'AUTVTP Cal-2019' = rnorm(365, 2, 1), 'AUTVTP Cal-2021' = rnorm(365, 2, 1),
                      'AUTVTP Cal-2022' = rnorm(365, 2, 1), 'AUTVTP Cal-2023' = rnorm(365, 2, 1),
                      'ATW Cal-2019' = rnorm(365, 2, 1), 'ATW Cal-2021' = rnorm(365, 2, 1),
                      'ATW Cal-2022' = rnorm(365, 2, 1), 'ATW Cal-2023' = rnorm(365, 2, 1),
                      'BRN Cal-2019' = rnorm(365, 2, 1), 'BRN Cal-2021' = rnorm(365, 2, 1),
                      'BRN Cal-2022' = rnorm(365, 2, 1), 'BRN Cal-2023' = rnorm(365, 2, 1),
                      'FEUA MDEC1' = rnorm(365, 2, 1),
                      check.names = FALSE)

现在我想在其自己的数据中保存/提取每个出现的带有日期列的列 table。理想情况下,然后将所有提取的数据 table 添加到列表中。我知道我应该用 for 循环以某种方式执行此操作,但我无法解决它。

在我收到每个产品的单独数据 tables 之后,我必须对每个数据 tables 执行以下操作(示例数据 table 现在是此处用于 AUTVTP Cal-2022):

DT <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 365),
                 'AUTVTP Cal-2022' = rnorm(365, 2, 1), check.names = FALSE)


DT <- DT %>%
  mutate(month = format(date, '%b'), 
         date = format(date, '%d')) %>%
  tidyr::pivot_wider(names_from = date, values_from = 'AUTVTP Cal-2022') %>%
  relocate(`01`, .after = month)

## Calculate monthly and quarterly mean values: ##
DT <- setDT(DT)[, monthAvg := rowMeans(.SD, na.rm = TRUE), .SDcols = -1]
DT <- DT[, quartAvg := mean(monthAvg), ceiling(seq_len(nrow(DT))/3)]
DT <- DT[, yearAvg := mean(monthAvg), ceiling(seq_len(nrow(DT))/12)]

## Round all values of the data table to 2 digits: ##
DT <- DT %>% mutate_if(is.numeric, round, 2)

我该怎么做?

重塑为长格式,然后拆分。

split(
  melt(dt.data, id.vars = "date"),
  by = "variable", keep.by = FALSE)

然后您可以使用 lapply 遍历列表并执行您的 tidyverse 代码执行的任何操作。

但是,通常您不应该拆分 data.table。它效率低下,而且通常没有必要。

编辑:

我建议你忘记拆分。将您的代码包装在这样的函数中:

foo <- function(DT, colname) {
  DT <- DT[, c("date", colname), with = FALSE]
  DT <- DT %>%
    mutate(month = format(date, '%b'), 
           date = format(date, '%d')) %>%
    tidyr::pivot_wider(names_from = date, values_from = colname) %>%
    relocate(`01`, .after = month)
  
  ## Calculate monthly and quarterly mean values: ##
  DT <- setDT(DT)[, monthAvg := rowMeans(.SD, na.rm = TRUE), .SDcols = -1]
  DT <- DT[, quartAvg := mean(monthAvg), ceiling(seq_len(nrow(DT))/3)]
  DT <- DT[, yearAvg := mean(monthAvg), ceiling(seq_len(nrow(DT))/12)]
  
  ## Round all values of the data table to 2 digits: ##
  DT %>% mutate_if(is.numeric, round, 2)
}

然后,当您需要 table 用于闪亮应用中的特定列时,您可以简单地调用此函数:

foo(dt.data, 'DEB Cal-2019')

如果你坚持pre-computing列表:

lapply(names(dt.data)[names(dt.data) != "date"], 
       foo, DT = dt.data)

使用 split.default 和每个列表的第一列 cbind 创建数据帧列表。

lapply(split.default(dt.data[, -1], names(dt.data[, -1])), cbind, dt.data[, 1])