"These samplers cannot be used in parallelized code"

Question

我正在阅读 vignette for the rgen 包，它提供 headers 用于从一些常见分布中抽样。在第一段中，它说：

Please note, these samplers, just like the ones in armadillo cannot be used in parallelized code as the underlying generation routines rely upon R calls that are single-threaded.

这对我来说是个新闻，我使用 RcppArmadillo 已经有一段时间了。我想知道是否有人可以详细说明这一点（或提供我可以在哪里阅读该问题的参考）。我特别想了解 "cannot be used" 在这里的含义；结果会出错，还是不会并行化？

Answer 1

这些函数使用 R 的随机数生成器，不能在并行代码中使用，因为这会导致未定义的行为。未定义的行为几乎可以导致任何事情。从我的角度来看，如果程序崩溃，你很幸运，因为这清楚地告诉你出了问题。

HPC task view lists some RNGs that are suitable for parallel computation. But you cannot use them easily with the distributions provided by rgen or RcppDist。相反，可以执行以下操作：

从 rgen 复制多元正态分布函数并调整其签名，使其采用 std::function<double()> 作为 N(0, 1) 分布随机数的来源。
使用快速 RNG 而不是 R 的 RNG。
在并行模式下使用相同的快速 RNG。

在代码中作为快速破解：

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>

inline arma::mat rmvnorm(unsigned int n, const arma::vec& mu, const arma::mat& S,
                         std::function<double()> rnorm = norm_rand){
  unsigned int ncols = S.n_cols;
  arma::mat Y(n, ncols);
  Y.imbue( rnorm ) ;
  return arma::repmat(mu, 1, n).t() + Y * arma::chol(S);
}

// [[Rcpp::export]]
arma::mat defaultRNG(unsigned int n, const arma::vec& mu, const arma::mat& S) {
  return rmvnorm(n, mu, S);
}

// [[Rcpp::export]]
arma::mat serial(unsigned int n, const arma::vec& mu, const arma::mat& S) {
  dqrng::normal_distribution dist(0.0, 1.0);
  dqrng::xoshiro256plus rng(42);
  return rmvnorm(n, mu, S, [&](){return dist(rng);});
}

// [[Rcpp::export]]
std::vector<arma::mat> parallel(unsigned int n, const arma::vec& mu, const arma::mat& S, unsigned int ncores = 1) {
  dqrng::normal_distribution dist(0.0, 1.0);
  dqrng::xoshiro256plus rng(42);
  std::vector<arma::mat> res(ncores);

  #pragma omp parallel num_threads(ncores)
  {
    dqrng::xoshiro256plus lrng(rng);      // make thread local copy of rng 
    lrng.jump(omp_get_thread_num() + 1);  // advance rng by 1 ... ncores jumps 
    res[omp_get_thread_num()] = rmvnorm(n, mu, S, [&](){return dist(lrng);});
  }
  return res;
}


/*** R
set.seed(42)
N <- 1000000
M <- 100
mu <- rnorm(M)
S <- matrix(rnorm(M*M), M, M)
S <- S %*% t(S)
system.time(defaultRNG(N, mu, S))
system.time(serial(N, mu, S))
system.time(parallel(N/2, mu, S, 2))
*/

结果：

> system.time(defaultRNG(N, mu, S))
   user  system elapsed 
  6.984   1.380   6.881 

> system.time(serial(N, mu, S))
   user  system elapsed 
  4.008   1.448   3.971 

> system.time(parallel(N/2, mu, S, 2))
   user  system elapsed 
  4.824   2.096   3.080

这里真正的性能提升来自使用更快的 RNG，这是可以理解的，因为这里的重点在于许多随机数而不是矩阵运算。如果我通过使用 N <- 100000 和 M <- 1000 更多地转向矩阵运算，我得到：

> system.time(defaultRNG(N, mu, S))
   user  system elapsed 
 16.740   1.768   9.725 

> system.time(serial(N, mu, S))
   user  system elapsed 
 13.792   1.864   6.792 

> system.time(parallel(N/2, mu, S, 2))
   user  system elapsed 
 14.112   3.900   5.859

这里我们清楚地看到，在所有情况下，用户时间都大于经过的时间。这样做的原因是我正在使用的并行 BLAS 实现 (OpenBLAS)。所以在决定方法之前有很多因素需要考虑。

"These samplers cannot be used in parallelized code"

"These samplers cannot be used in parallelized code"

r

rcpp