来自 data.table::frollmean() 的意外高内存使用

Unexpectedly-high memory usage from data.table::frollmean()

我有一个包含 2000 万行和 20 列的数据 table,我对其应用了 return 列出的矢量化操作,这些操作本身是通过引用数据 table 中的其他列来分配的.

在这些操作中,内存使用量可预测地适度增加,直到我使用自适应 window 将(可能是高效的)frollmean() 函数应用于包含长度为 10 的列表的列。 运行 即使是 Windows 10 x64 上 R 4.1.2 中更小的 RepRex,使用包 data.table 1.14.2,执行 frollmean() 时内存使用量也会飙升 ~17GB,在返回之前,如 Windows' 任务管理器(性能选项卡)和 Rprof 内存分析报告中所见。

我知道 frollmean() 在可能的情况下使用并行性,所以我确实设置了 setDTthreads(threads = 1L) 以确保内存峰值不是为额外的数据复制 table 而造成的核心。

我的问题:为什么 frollmean() 相对于其他操作使用了这么多内存,我可以避免吗?

RepRex

library(data.table)
set.seed(1)
setDTthreads(threads = 1L)

obs   <- 10^3 # Number of rows in the data table
len   <- 10   # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window

# Generate representative data
DT <- data.table(
  V1 = sample(x =  1:10, size = obs, replace = TRUE),
  V2 = sample(x = 11:20, size = obs, replace = TRUE),
  V3 = sample(x = 21:30, size = obs, replace = TRUE)
)

# Apply representative vectorized operations, assigning by reference
DT[, V4 := Map(seq, from = V1, to = V2, length.out = len)] # This is a list
DT[, V5 := Map("*", V4, V3)] # This is a list
DT[, V6 := Map("*", V4, V5)] # This is a list

# Profile the memory usage
Rprof(memory.profiling = TRUE)

# Rolling mean
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]

# Report the memory usage
Rprof(NULL)
summaryRprof(memory = "both")

考虑避免在列中嵌入列表。回想一下 data.framedata.table 类 是 list 类型的扩展,其中 typeof(DT) returns "list"。因此,考虑 运行 跨向量列,而不是嵌套列表上的 运行 frollmean

obs   <- 10^3 # Number of rows in the data table
len   <- 10   # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window

# CALCULATE SEC VECTOR (USING mapply THE PARENT TO ITS WRAPPER Map)
set.seed(1)
V1 = sample(x =  1:10, size = obs, replace = TRUE)
V2 = sample(x = 11:20, size = obs, replace = TRUE)
V3 = sample(x = 21:30, size = obs, replace = TRUE)
seq_vec <- as.vector(mapply(seq, from = V1, to = V2, length.out = len))

# BUILD DATA.TABLE USING SEQ VECTOR FOR FLAT ATOMIC VECTOR COLUMNS
DT_ <- data.table(
  WIDTH = rep(width, obs),
  V1 = rep(V1, each=len),
  V2 = rep(V2, each=len),
  V3 = rep(V3, each=len),
  V4 = seq_vec
)[, V5 := V4*V3][,V6 := V4*V5]

DT_
       WIDTH V1 V2 V3       V4       V5       V6
    1:     1  9 20 29  9.00000 261.0000 2349.000
    2:     2  9 20 29 10.22222 296.4444 3030.321
    3:     2  9 20 29 11.44444 331.8889 3798.284
    4:     2  9 20 29 12.66667 367.3333 4652.889
    5:     2  9 20 29 13.88889 402.7778 5594.136
   ---                                          
 9996:     2  5 16 26 11.11111 288.8889 3209.877
 9997:     2  5 16 26 12.33333 320.6667 3954.889
 9998:     2  5 16 26 13.55556 352.4444 4777.580
 9999:     2  5 16 26 14.77778 384.2222 5677.951
10000:     2  5 16 26 16.00000 416.0000 6656.000

然后通过V1和V2分组计算frollmean

DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE),  by=.(V1, V2)]

输出应等同于嵌套列表值列:

identical(DT$V4[[1]], DT_$V4[1:len])
[1] TRUE
identical(DT$V5[[1]], DT_$V5[1:len])
[1] TRUE
identical(DT$V6[[1]], DT_$V6[1:len])
[1] TRUE
identical(DT$V7[[1]], DT_$V7[1:len])
[1] TRUE

这样做时,分析显示不同计算方法之间的步骤和内存更少。下面在 obs <- 10^5.

上运行

frollmean 在嵌套列表列上(使用 DT

# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
               self.time self.pct total.time total.pct mem.total
"froll"             1.30    76.47       1.30     76.47    1584.6
"FUN"               0.14     8.24       0.30     17.65     161.3
"eval"              0.12     7.06       1.46     85.88    1670.9
"vapply"            0.10     5.88       0.40     23.53     181.3
"parent.frame"      0.04     2.35       0.04      2.35      24.8

$by.total
               total.time total.pct mem.total self.time self.pct
"[.data.table"       1.70    100.00    1765.9      0.00     0.00
"["                  1.70    100.00    1765.9      0.00     0.00
"eval"               1.46     85.88    1670.9      0.12     7.06
"froll"              1.30     76.47    1584.6      1.30    76.47
"frollmean"          1.30     76.47    1584.6      0.00     0.00
"vapply"             0.40     23.53     181.3      0.10     5.88
"%chin%"             0.40     23.53     181.3      0.00     0.00
"vapply_1c"          0.40     23.53     181.3      0.00     0.00
"which"              0.40     23.53     181.3      0.00     0.00
"FUN"                0.30     17.65     161.3      0.14     8.24
"parent.frame"       0.04      2.35      24.8      0.04     2.35

$sample.interval
[1] 0.02

$sampling.time
[1] 1.7

frollmean 按组对原子向量列(使用 DT_

# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE),  by=.(V1, V2)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
               self.time self.pct total.time total.pct mem.total
"[.data.table"      0.02    33.33       0.06    100.00      18.7
"forderv"           0.02    33.33       0.02     33.33       0.0
"froll"             0.02    33.33       0.02     33.33      10.6

$by.total
               total.time total.pct mem.total self.time self.pct
"[.data.table"       0.06    100.00      18.7      0.02    33.33
"["                  0.06    100.00      18.7      0.00     0.00
"forderv"            0.02     33.33       0.0      0.02    33.33
"froll"              0.02     33.33      10.6      0.02    33.33
"frollmean"          0.02     33.33      10.6      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 0.06

(有趣的是,在我的 Linux 8 GB RAM 笔记本电脑上,在 10^6 obs,列出列但 向量列方法提出 Error: cannot allocate vector of size 15.3 Gb).