STL random_shuffle 生成高度相关的序列

STL random_shuffle generates highly correlated sequences

注意已接受的答案指出问题在于重新播种。重新播种不是原因。没有重新播种的测试在发布前产生了高相关性。见注释 1。

我在 R 中生成了 1,000,000 个均匀的随机数,对序列进行排序,然后调用 std::random_shuffle() 将该序列的副本置换 100 次。结果证明 100 个置换序列极其相关。但是,如果我不首先对统一数字进行排序,则 100 个排列序列或多或少是不相关的。下面是代码。

// [[Rcpp::export]]
IntegerVector testRandomShuffle(IntegerVector x, int rd) // rd is the seed
{
  IntegerVector y(x.begin(), x.end()); // copy
  std::srand(rd); // seeding
  std::random_shuffle(y.begin(), y.end());
  return y;
}


/***R
v = runif(1000000)
vSorted = sort(v)
sqc = 1L : length(v) # indexes
rd = sample.int(length(v), 100) # random seeds


# Compute correlation matrices
corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x) 
  v[testRandomShuffle(sqc, x)])))
corMatForSorted = cor(as.data.frame(lapply(rd, function(x) 
  vSorted[testRandomShuffle(sqc, x)])))


# plot histograms
par(mfrow = c(1, 2)) 
hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, xlab = 
  "Correlation for unsorted")
hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, xlab = 
  "Correlation for sorted")
*/

我是不是做错了什么?我只是希望改组排序和未排序的序列会产生或多或少相同的相关分布。这些相关性应该有多小是另一回事。使用 R 的原生函数 sample.int() 进行置换的相同实验在两种情况下都产生了低相关性。

谢谢!

注意 1:问题是我在 Windows 使用 g++ 4.9.3 附带的 Rtools 3.4。此版本 C++ 库中的 shuffle 函数无法正常工作。

注2:确认Rcpp::sample()在多线程中工作。一个小测试用例:

// [[Rcpp::depends(RcppParallel)]]
# include <RcppParallel.h>
# include <Rcpp.h>
using namespace Rcpp;


struct testSampleInPara: public RcppParallel::Worker
{
  IntegerVector tmp;
  List rst;


  void operator() (std::size_t st, std::size_t end)
  {
    if(st == 0)
    {
      // is tmp / rst a copy or a reference ?
      std::cout << std::to_string((std::size_t)&tmp[0]) + "\n";
      IntegerVector rst0 = Rcpp::sample(tmp, 5);
      rst[0] = rst0; // assume rst not a copy
    }
    else // if(st == 1)
    {
      std::cout << std::to_string((std::size_t)&tmp[0]) + "\n";
      IntegerVector rst1 = Rcpp::sample(tmp, 10);
      rst[1] = rst1;
    }
  }


  testSampleInPara(IntegerVector tmp, List rst):
    tmp(tmp), rst(rst)
  {
    RcppParallel::parallelFor(0, 2, *this);
  }
};


// [[Rcpp::export]]
List testIfSampleCopy(IntegerVector tmp)
{
  List rst(2);
  testSampleInPara(tmp, rst);
  return rst;
}

/***R
testIfSampleCopy(1L : 10L)
# printout:
# 356036792
# 356036792
# [[1]]
# [1] 10  5  9  7  8
# 
# [[2]]
#  [1] 10  3  7  6  2  1  8  4  9  5
*/

我对 Rcpp 容器的体验在多线程性能方面很糟糕。我通常创建指向 Rcpp 容器起始元素的指针或指针数组,在线程之间共享这些指针和容器的大小。注意 Rcpp::sample() 需要 returns Rcpp 个容器。

注意 3:通过阅读 Rcpp 源代码,最好的解决方案是在本机 C++ 中编写自定义 sample()Rcpp::sample() 的核心组成部分是 unif_rand()。将 unif_rand() 集成到现代版本的 Fisher–Yates Shuffle 中。问题已解决。

注意 4:在多线程环境中使用 unif_rand() 会大大降低线程的速度。我没有时间按照 Dirk Eddelbuettel 的建议阅读文档,但我猜想 R 的源同步 unif_rand() 对我们来说是不可见的,比如 malloc() in C。最终的解决方案是包含 // [[Rcpp::plugins("cpp11")]] 并使用 std::random.

std::random_shuffle(begin, end) often uses std::rand,这被认为是一个糟糕的随机数生成器。来自 cppreference:

rand() is not recommended for serious random-number generation needs. It is recommended to use C++11's random number generation facilities to replace rand().

改用std::shuffle

// Note the lack of `int rd`. `std::random_device` is better for
// seeding purposes, but it is non-deterministic.
IntegerVector testShuffle(IntegerVector x)
{
  IntegerVector y(x.begin(), x.end()); // copy

  // std::mt19937 is a rather heavy type. As such, it's often recommended
  // to make it a static variable. If you will be calling this function
  // from multiple threads, you'd want to make it `thread_local` instead
  // of `static` (or otherwise avoid the data race on `engine`).
  static std::mt19937 engine = [] {
    // Using the Immediately Invoked Lambda Expression (IILE) idiom to
    // initialize the static variable.

    // Seed the RNG.
    std::random_device rd;

    // Note that there are better ways to seed the mersenne twister.
    // This way is flawed, as it can't possibly initialize all of the
    // mersenne twister's state, but it's the simplest way for
    // demonstration purposes
    std::mt19937 engine(rd());

    return engine;
  }();

  // You should be able to just use y.begin(), y.end()
  std::shuffle(y.begin(), y.end(), engine);
  return y;
}

如果您想要确定性种子,请注意单个 int 的信息不足以为 std::mt19937 完全播种,但您仍然可以使用它:

IntegerVector testShuffle(IntegerVector x, int seed)
{
  IntegerVector y(x.begin(), x.end());

  static std::mt19937 engine;

  // Not thread-friendly, but simple.
  // Also, note that you'll get bad results if you seed a mersenne twister
  // (or a lot of RNGs) with 0, so avoid that
  engine.seed(seed);

  std::shuffle(y.begin(), y.end(), engine);
  return y;
}

您的统计直觉和对随机数生成器的使用不太正确。如果我接受你的代码,为 Rcpp.hnamespace 指令添加缺失的包含,然后 简单地注释掉重新播种 然后两个直方图如你所料重叠。

下面的代码。

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector testRandomShuffle(IntegerVector x, int rd) { // rd is the seed
  IntegerVector y(x.begin(), x.end()); // copy
  //std::srand(rd); // seeding
  std::random_shuffle(&y[0], &*y.end());
  return y;
}


/***R
#v = runif(1000000)
v = runif(10000)
vSorted = sort(v)
sqc = 1L : length(v) # indexes
rd = sample.int(length(v), 100) # random seeds


# Compute correlation matrices
corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x) 
  v[testRandomShuffle(sqc, x)])))
corMatForSorted = cor(as.data.frame(lapply(rd, function(x) 
  vSorted[testRandomShuffle(sqc, x)])))


# plot histograms
par(mfrow = c(1, 2)) 
hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, xlab = 
  "Correlation for unsorted")
hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, xlab = 
  "Correlation for sorted")
*/

我也将 N 降低了两个数量级。够用了。

编辑: 为了完整起见,只使用一个 RNG 的纯 Rcpp 版本可以在任何 Rcpp 工作的地方工作,包括 Windows 和 g++-4.9.3

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector testRandomShuffle(IntegerVector x, int rd) { // rd is the seed
  IntegerVector y(x.begin(), x.end()); // copy
  std::random_shuffle(&y[0], &*y.end());
  return y;
}

// [[Rcpp::export]]
IntegerVector testRandomSample(IntegerVector x) { // rd is the seed
  IntegerVector y(x.begin(), x.end()); // copy
  return sample(y, y.size());
}

/***R
set.seed(123)  # now we're reproducible
v <- runif(10000)
vSorted <- sort(v)
sqc <- 1L : length(v) # indexes
rd <- sample.int(length(v), 100) # random seeds

# Compute correlation matrices
corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x) 
  v[testRandomSample(sqc)])))
corMatForSorted = cor(as.data.frame(lapply(rd, function(x) 
  vSorted[testRandomSample(sqc)])))

# plot histograms
par(mfrow = c(1, 2)) 
hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, 
     xlab = "Correlation for unsorted", main="Unsorted")
hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, 
     xlab = "Correlation for sorted", main="Sorted")
*/

它仍然包含未使用的旧版本。结果图现在是

为了完整起见,在基准测试中,Rcpp 的 sample() 例程也更快:

R> library(rbenchmark)
R> benchmark(testRandomShuffle(x, 1), testRandomSample(x))[,1:4]
                     test replications elapsed relative
2     testRandomSample(x)          100   1.402    1.000
1 testRandomShuffle(x, 1)          100   1.868    1.332
R>