为什么我使用 Perl 的 List::Util::shuffle 得到一个糟糕的随机分布？

Question

我有一张 collection 的数百张黑胶唱片，按目录 ID 字符串按字母数字顺序组织。我写了一个脚本，通过抽样一组打乱的目录 ID，从我的 collection 中随机为我 selects 20 条记录。但是，我发现 select 对我来说记录的分布通常并不好。很多时候它会 select 2 个具有连续目录 ID 的记录，and/or 几个记录彼此靠近。 select从 800 条记录中提取 20 条记录时，这种情况应该很少发生。

我将目录 ID 列表存储在 @selection 数组中，并从该数组中随机抽取 20 个项目，我从随机排列的数组中分配前 20 个项目：

@selection = (shuffle @selection)[0 .. 19];

无奈之下，我尝试了这种丑陋的技术来试图强制更好的随机性，但它似乎没有任何区别：

@selection = shuffle @selection; sleep 1;
@selection = reverse @selection; sleep 1;
@selection = (shuffle @selection)[0 .. 19];

Answer 1

有 C(800, 20) = 3.73 × 10³⁹ 种方法可以从 800 个标题中选择 20 个。

有 C(781, 20) = 2.29 × 10³⁹ 种方法可以从 800 个没有相邻的标题中选择 20 个。^[1]

因此有 (2.29 × 10³⁹) / (3.73 × 10³⁹) = 61.4% 的机会选择一组不包含相邻标题。

因此有 1 - 61.4% = 38.6% 的机会选择包含相邻标题的集合。

既然我们知道会发生什么，让我们来测试一下 shuffle。

测试：

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw( shuffle );

my $num_tests = 100_000;
my $N = 800;
my @titles = 0..($N-1);
my $has_adjacent_titles = 0;
for (1..$num_tests) {
   my @shuffled_selection = ( shuffle(@titles) )[0..19];
   my @ordered = sort { $a <=> $b } @shuffled_selection;
   ++$has_adjacent_titles if grep { $ordered[$_-1]+1 == $ordered[$_] } 1..$#ordered;
}

printf "%.1f%%\n", $has_adjacent_titles / $num_tests * 100;

输出：

>a.pl
38.6%

>a.pl
38.8%

>a.pl
38.5%

看起来 shuffle 工作得很好。

参见 Combinatorial restriction on choosing adjacent objects、

为什么我使用 Perl 的 List::Util::shuffle 得到一个糟糕的随机分布？

Why am I getting a bad random distribution with Perl's List::Util::shuffle?

arrays

perl

shuffle