Rcpp:通过引用列出 <-> 矩阵转换?? + 优化矩阵编程时的内存分配
Rcpp: List <-> Matrix conversions by reference?? + Optimizing memory allocation when programming with matrices
据我了解,在 Rcpp 中,矩阵被实现为具有维度属性的向量,而列表是不同对象的一种向量。因此,有一个技巧可以将等长的向量列表(即 DataFrame)转换为 NumericMatrix(或 arma::mat 就此而言),反之亦然,无需逐列复制数据进入新的数据结构?
我认为这是不可能的,因为它对 R 用户非常有用,而且我相信我会遇到它。在这种假设下——这意味着 List 和 Matrix 方法需要单独实现——我的问题就变成了如何为同样有效的 Lists 和 Matrices 编写 Rcpp 函数。我的经验是列表的内存效率更高,因为它们在填充时自动分配内存,而矩阵需要定义并预先分配所有内存。考虑下面的例子:我写了两个版本的分组求和——一个用于矩阵,一个用于列表 / data.frames:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix gsumm(NumericMatrix x, int ng = 0, IntegerVector g = 0, bool fill = false) {
int n = x.nrow();
int c = x.ncol();
NumericMatrix sum(ng, c); // here memory needs to be allocated
for(int j = c; j--; ) {
NumericMatrix::Column column = x( _ , j);
NumericMatrix::Column sumj = sum( _ , j);
for(int i = n; i--; ) sumj[g[i]-1] += column[i];
}
if(fill) {
NumericMatrix out(n, c); // allocating space for this matrix is the main speed-killer
for(int j = c; j--; ) {
NumericMatrix::Column colo = out( _ , j);
NumericMatrix::Column sumj = sum( _ , j);
for(int i = n; i--; ) colo[i] = sumj[g[i]-1];
}
return out;
} else return sum;
}
// [[Rcpp::export]]
List gsuml(List x, int ng = 0, IntegerVector g = 0, bool fill = false) {
int l = x.size(), n;
List sum(l);
for(int j = l; j--; ) {
NumericVector column = x[j];
n = column.size();
NumericVector sumj(ng);
for(int i = n; i--; ) sumj[g[i]-1] += column[i];
sum[j] = sumj;
}
if(fill) for(int j = l; j--; ) {
NumericVector sgj(n);
NumericVector sumj = sum[j];
for(int i = n; i--; ) sgj[i] = sumj[g[i]-1];
sum[j] = sgj;
}
return sum;
}
如果 fill = false
,则返回组聚合数据,而如果 fill = true
,则返回相同维度的数据,其中每个元素都被其组内总和替换。在这两种情况下,列表方法都更快,特别是如果 fill = true
需要在填充之前创建一个完整的空 n x c 矩阵:
library(microbenchmark)
testm = matrix(rnorm(10000000), ncol = 1000)
testl = as.data.frame(testm)
ng = 1000
g = sample.int(ng, 10000, replace = TRUE)
> microbenchmark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE))
Unit: milliseconds
expr min lq mean median uq max neval
gsumm(testm, ng, g, fill = FALSE) 15.45847 16.28559 17.82400 16.67717 17.41415 63.40689 100
gsuml(testl, ng, g, fill = FALSE) 13.61055 14.12062 16.06388 14.59342 15.45356 96.93972 100
cld
a
a
> microbenchmark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE))
Unit: milliseconds
expr min lq mean median uq max neval cld
gsumm(testm, ng, g, fill = TRUE) 34.45835 36.28886 51.42828 39.87513 60.51453 242.2054 100 b
gsuml(testl, ng, g, fill = TRUE) 29.92314 30.69269 34.83283 31.33239 32.67136 115.8745 100 a
如果两种方法能达到同样的速度就好了。甚至可能变得更有效率。我想为此需要设计一种避免大量先验内存分配的矩阵编程方式。感谢您的任何意见和建议!
两种方法分配的内存量相同。当使用 bench::mark()
进行基准测试时,您可以从 mem_alloc
列中看到这一点:
> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 14.1ms 15.1ms 64.7 7.63MB 0 33 0 510ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.5ms 15.1ms 67.0 7.68MB 4.19 32 2 478ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 39.2ms 45.6ms 20.0 83.9MB 20.0 5 5 250ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 30.3ms 32ms 26.7 84MB 20.0 8 6 299ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
不过,不仅分配了内存,反正很快,而且到处都初始化为零。这在您的情况下是不必要的,可以通过将 Rcpp::NumericMatrix mat(rows, cols)
替换为 Rcpp::NumericMatrix mat = Rcpp::no_init(rows, cols)
以及将 Rcpp::NumericVector vec(length)
替换为 Rcpp::NumericVector vec = Rcpp::no_init(length)
来避免。如果我用你的代码这样做,两个函数都会获利:
> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 13ms 14.7ms 67.1 7.63MB 0 34 0 507ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.8ms 14.6ms 67.4 7.68MB 2.04 33 1 489ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 27.5ms 31ms 26.6 83.9MB 10.7 10 4 375ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 24.7ms 26.4ms 36.9 84MB 36.9 9 9 244ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
虽然我不确定为什么矩阵版本从不初始化内存中获益更多。
据我了解,在 Rcpp 中,矩阵被实现为具有维度属性的向量,而列表是不同对象的一种向量。因此,有一个技巧可以将等长的向量列表(即 DataFrame)转换为 NumericMatrix(或 arma::mat 就此而言),反之亦然,无需逐列复制数据进入新的数据结构?
我认为这是不可能的,因为它对 R 用户非常有用,而且我相信我会遇到它。在这种假设下——这意味着 List 和 Matrix 方法需要单独实现——我的问题就变成了如何为同样有效的 Lists 和 Matrices 编写 Rcpp 函数。我的经验是列表的内存效率更高,因为它们在填充时自动分配内存,而矩阵需要定义并预先分配所有内存。考虑下面的例子:我写了两个版本的分组求和——一个用于矩阵,一个用于列表 / data.frames:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix gsumm(NumericMatrix x, int ng = 0, IntegerVector g = 0, bool fill = false) {
int n = x.nrow();
int c = x.ncol();
NumericMatrix sum(ng, c); // here memory needs to be allocated
for(int j = c; j--; ) {
NumericMatrix::Column column = x( _ , j);
NumericMatrix::Column sumj = sum( _ , j);
for(int i = n; i--; ) sumj[g[i]-1] += column[i];
}
if(fill) {
NumericMatrix out(n, c); // allocating space for this matrix is the main speed-killer
for(int j = c; j--; ) {
NumericMatrix::Column colo = out( _ , j);
NumericMatrix::Column sumj = sum( _ , j);
for(int i = n; i--; ) colo[i] = sumj[g[i]-1];
}
return out;
} else return sum;
}
// [[Rcpp::export]]
List gsuml(List x, int ng = 0, IntegerVector g = 0, bool fill = false) {
int l = x.size(), n;
List sum(l);
for(int j = l; j--; ) {
NumericVector column = x[j];
n = column.size();
NumericVector sumj(ng);
for(int i = n; i--; ) sumj[g[i]-1] += column[i];
sum[j] = sumj;
}
if(fill) for(int j = l; j--; ) {
NumericVector sgj(n);
NumericVector sumj = sum[j];
for(int i = n; i--; ) sgj[i] = sumj[g[i]-1];
sum[j] = sgj;
}
return sum;
}
如果 fill = false
,则返回组聚合数据,而如果 fill = true
,则返回相同维度的数据,其中每个元素都被其组内总和替换。在这两种情况下,列表方法都更快,特别是如果 fill = true
需要在填充之前创建一个完整的空 n x c 矩阵:
library(microbenchmark)
testm = matrix(rnorm(10000000), ncol = 1000)
testl = as.data.frame(testm)
ng = 1000
g = sample.int(ng, 10000, replace = TRUE)
> microbenchmark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE))
Unit: milliseconds
expr min lq mean median uq max neval
gsumm(testm, ng, g, fill = FALSE) 15.45847 16.28559 17.82400 16.67717 17.41415 63.40689 100
gsuml(testl, ng, g, fill = FALSE) 13.61055 14.12062 16.06388 14.59342 15.45356 96.93972 100
cld
a
a
> microbenchmark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE))
Unit: milliseconds
expr min lq mean median uq max neval cld
gsumm(testm, ng, g, fill = TRUE) 34.45835 36.28886 51.42828 39.87513 60.51453 242.2054 100 b
gsuml(testl, ng, g, fill = TRUE) 29.92314 30.69269 34.83283 31.33239 32.67136 115.8745 100 a
如果两种方法能达到同样的速度就好了。甚至可能变得更有效率。我想为此需要设计一种避免大量先验内存分配的矩阵编程方式。感谢您的任何意见和建议!
两种方法分配的内存量相同。当使用 bench::mark()
进行基准测试时,您可以从 mem_alloc
列中看到这一点:
> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 14.1ms 15.1ms 64.7 7.63MB 0 33 0 510ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.5ms 15.1ms 67.0 7.68MB 4.19 32 2 478ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 39.2ms 45.6ms 20.0 83.9MB 20.0 5 5 250ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 30.3ms 32ms 26.7 84MB 20.0 8 6 299ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
不过,不仅分配了内存,反正很快,而且到处都初始化为零。这在您的情况下是不必要的,可以通过将 Rcpp::NumericMatrix mat(rows, cols)
替换为 Rcpp::NumericMatrix mat = Rcpp::no_init(rows, cols)
以及将 Rcpp::NumericVector vec(length)
替换为 Rcpp::NumericVector vec = Rcpp::no_init(length)
来避免。如果我用你的代码这样做,两个函数都会获利:
> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 13ms 14.7ms 67.1 7.63MB 0 34 0 507ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.8ms 14.6ms 67.4 7.68MB 2.04 33 1 489ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 27.5ms 31ms 26.6 83.9MB 10.7 10 4 375ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 24.7ms 26.4ms 36.9 84MB 36.9 9 9 244ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>
虽然我不确定为什么矩阵版本从不初始化内存中获益更多。