Rcpp:通过引用列出 <-> 矩阵转换?? + 优化矩阵编程时的内存分配

Rcpp: List <-> Matrix conversions by reference?? + Optimizing memory allocation when programming with matrices

据我了解,在 Rcpp 中,矩阵被实现为具有维度属性的向量,而列表是不同对象的一种向量。因此,有一个技巧可以将等长的向量列表(即 DataFrame)转换为 NumericMatrix(或 arma::mat 就此而言),反之亦然,无需逐列复制数据进入新的数据结构?

我认为这是不可能的,因为它对 R 用户非常有用,而且我相信我会遇到它。在这种假设下——这意味着 List 和 Matrix 方法需要单独实现——我的问题就变成了如何为同样有效的 Lists 和 Matrices 编写 Rcpp 函数。我的经验是列表的内存效率更高,因为它们在填充时自动分配内存,而矩阵需要定义并预先分配所有内存。考虑下面的例子:我写了两个版本的分组求和——一个用于矩阵,一个用于列表 / data.frames:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix gsumm(NumericMatrix x, int ng = 0, IntegerVector g = 0, bool fill = false) {
  int n = x.nrow(); 
  int c = x.ncol(); 
  NumericMatrix sum(ng, c); // here memory needs to be allocated
    for(int j = c; j--; ) { 
      NumericMatrix::Column column = x( _ , j); 
      NumericMatrix::Column sumj = sum( _ , j); 
      for(int i = n; i--; ) sumj[g[i]-1] += column[i];
    }
  if(fill) {
  NumericMatrix out(n, c); // allocating space for this matrix is the main speed-killer
    for(int j = c; j--; ) {
      NumericMatrix::Column colo = out( _ , j); 
      NumericMatrix::Column sumj = sum( _ , j);
      for(int i = n; i--; ) colo[i] = sumj[g[i]-1];
    }
    return out;
  } else return sum;
}

// [[Rcpp::export]]
List gsuml(List x, int ng = 0, IntegerVector g = 0, bool fill = false) {
  int l = x.size(), n;
  List sum(l);
    for(int j = l; j--; ) {
      NumericVector column = x[j];
      n = column.size();
      NumericVector sumj(ng);
      for(int i = n; i--; ) sumj[g[i]-1] += column[i];
      sum[j] = sumj;
    }
    if(fill) for(int j = l; j--; ) {
      NumericVector sgj(n);
      NumericVector sumj = sum[j];
      for(int i = n; i--; ) sgj[i] = sumj[g[i]-1];
      sum[j] = sgj; 
    }
    return sum;
}

如果 fill = false,则返回组聚合数据,而如果 fill = true,则返回相同维度的数据,其中每个元素都被其组内总和替换。在这两种情况下,列表方法都更快,特别是如果 fill = true 需要在填充之前创建一个完整的空 n x c 矩阵:

library(microbenchmark)
testm = matrix(rnorm(10000000), ncol = 1000)
testl = as.data.frame(testm)
ng = 1000
g = sample.int(ng, 10000, replace = TRUE)

> microbenchmark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE))
Unit: milliseconds
                              expr      min       lq     mean   median       uq      max neval
 gsumm(testm, ng, g, fill = FALSE) 15.45847 16.28559 17.82400 16.67717 17.41415 63.40689   100
 gsuml(testl, ng, g, fill = FALSE) 13.61055 14.12062 16.06388 14.59342 15.45356 96.93972   100
 cld
   a
   a

> microbenchmark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE))
Unit: milliseconds
                             expr      min       lq     mean   median       uq      max neval cld
 gsumm(testm, ng, g, fill = TRUE) 34.45835 36.28886 51.42828 39.87513 60.51453 242.2054   100   b
 gsuml(testl, ng, g, fill = TRUE) 29.92314 30.69269 34.83283 31.33239 32.67136 115.8745   100  a 

如果两种方法能达到同样的速度就好了。甚至可能变得更有效率。我想为此需要设计一种避免大量先验内存分配的矩阵编程方式。感谢您的任何意见和建议!

两种方法分配的内存量相同。当使用 bench::mark() 进行基准测试时,您可以从 mem_alloc 列中看到这一点:

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
  expression                           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                        <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 14.1ms 15.1ms      64.7    7.63MB     0       33     0      510ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.5ms 15.1ms      67.0    7.68MB     4.19    32     2      478ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                       <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 39.2ms 45.6ms      20.0    83.9MB     20.0     5     5      250ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 30.3ms   32ms      26.7      84MB     20.0     8     6      299ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

不过,不仅分配了内存,反正很快,而且到处都初始化为零。这在您的情况下是不必要的,可以通过将 Rcpp::NumericMatrix mat(rows, cols) 替换为 Rcpp::NumericMatrix mat = Rcpp::no_init(rows, cols) 以及将 Rcpp::NumericVector vec(length) 替换为 Rcpp::NumericVector vec = Rcpp::no_init(length) 来避免。如果我用你的代码这样做,两个函数都会获利:

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
  expression                           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                        <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE)   13ms 14.7ms      67.1    7.63MB     0       34     0      507ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.8ms 14.6ms      67.4    7.68MB     2.04    33     1      489ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                       <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 27.5ms   31ms      26.6    83.9MB     10.7    10     4      375ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 24.7ms 26.4ms      36.9      84MB     36.9     9     9      244ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

虽然我不确定为什么矩阵版本从不初始化内存中获益更多。