为什么这两个执行相同操作的脚本之间存在如此大的性能差异？

Question

这是 Euler Project 的 problem36。将所有以 2 为底和以 10 为底的回文数字求和。

我最初尝试以更实用的方式解决它。

运行不到 6 秒。

[1..1_000_000]
    .grep( * !%% 2 )
    .grep( -> $x { $x == $x.flip } )
    .grep( -> $y { $y.base(2) == $y.base(2).flip } )
    .sum.say

令人惊讶的是，这花了 12 秒，即使我只生成奇数，因此 even 跳过测试。

(1,3 ... 1_000_000)
    .grep( -> $x { $x == $x.flip } )
    .grep( -> $y { $y.base(2) == $y.base(2).flip } )
    .sum.say

这将在大约 3 秒内运行。

my @pals;
for (1,3 ... 1_000_000) -> $x {
    next unless $x == $x.flip;
    next unless $x.base(2) == $x.base(2).flip;
    @pals.push($x);
}

say [+] @pals;

我还注意到使用

之间存在显着差异

for (1,3 ... 1_000_000) -> $x { ...

和

for [1,3 ... 1_000_000] -> $x { ...

有人知道为什么流式版本比迭代版本慢这么多吗？而且，为什么这两个 for 循环在性能上会有如此不同？

Answer 1

构造 [...] 是一个数组组合器。它急切地迭代其中找到的可迭代对象，并将每个值存储到数组中。只有这样我们才能继续进行迭代。这会导致更多的内存分配并且对缓存的友好性降低。相比之下，括号什么都不做（除了分组，但除此之外它们不添加任何语义）。因此：

[1..1_000_000]
    .grep( * !%% 2 )
    .grep( -> $x { $x == $x.flip } )
    .grep( -> $y { $y.base(2) == $y.base(2).flip } )
    .sum.say

将分配并建立一个百万元数组并对其进行迭代，同时：

(1..1_000_000)
    .grep( * !%% 2 )
    .grep( -> $x { $x == $x.flip } )
    .grep( -> $y { $y.base(2) == $y.base(2).flip } )
    .sum.say

运行得更快，因为它不需要那样做。

此外，... 运算符目前比 .. 运算符慢得多。它并非注定永远如此，只是到目前为止受到的关注要少得多。由于 .grep 也得到了很好的优化，事实证明，过滤掉范围内的元素会更快 - 无论如何，现在。

最后，使用 == 比较 base 和 flip 的（字符串）结果效率不高，因为它会将它们解析回整数，而我们可以使用 eq 并比较字符串：

(1 .. 1_000_000)
    .grep(* !%% 2)
    .grep( -> $x { $x eq $x.flip } )
    .grep( -> $y { $y.base(2) eq $y.base(2).flip } )
    .sum.say

Answer 2

如果你想要更快的东西，你可以编写自己的序列生成器。

gather {
  loop (my int $i = 1; $i < 1_000_000; $i += 2) {
    take $i
  }
}
.grep( -> $x { $x eq $x.flip } )
.grep( -> $y { $y.base(2) eq $y.base(2).flip } )
.sum.say

这大约需要 4 秒。

或者为了更快，您可以自己创建 Iterator 对象。

class Odd does Iterator {
    has uint $!count = 1;

    method pull-one () {
        if ($!count += 2) < 1_000_000 {
            $!count
        } else {
            IterationEnd
        }
    }
}

Seq.new(Odd.new)
.grep( -> $x { $x == $x.flip } )
.grep( -> $y { $y.base(2) == $y.base(2).flip } )
.sum.say

这只需要大约 2 秒。

当然，如果你想尽可能快，完全摆脱序列迭代。

也使用原生 ints.

同时缓存以 10 为基数的字符串。 (my $s = ~$x)

my int $acc = 0;
loop ( my int $x = 1; $x < 1_000_000; $x += 2) {
  next unless (my $s = ~$x) eq $s.flip;
  next unless $x.base(2) eq $x.base(2).flip;
  $acc += $x
}
say $acc;

这会缩短到大约 0.45 秒。

（缓存 .base(2) 似乎没有任何作用。）

如果不直接使用 nqp 操作，这可能接近最小值。

我试着写了一个原生的 int bit flipper，但是它变慢了。 0.5 秒。
（我没有想出这个算法，我只是把它适配到 Raku。我还添加了 +> $in.msb 来适应这个问题。）

我猜 spesh 将离开不需要在那里的操作。
或者它可能 JIT 不是很好。

对于大于 1_000_000 的值，它的性能可能更高。
（.base(2).flip 是 O(log n) 而这是 O(1)。）

sub flip-bits ( int $in --> int ) {
  my int $n =
       ((($in +& (my int $ = 0xaaaaaaaa)) +> 1) +| (($in +& (my int $ = 0x55555555)) +< 1));
  $n = ((($n  +& (my int $ = 0xcccccccc)) +> 2) +| (($n  +& (my int $ = 0x33333333)) +< 2));
  $n = ((($n  +& (my int $ = 0xf0f0f0f0)) +> 4) +| (($n  +& (my int $ = 0x0f0f0f0f)) +< 4));
  $n = ((($n  +& (my int $ = 0xff00ff00)) +> 8) +| (($n  +& (my int $ = 0x00ff00ff)) +< 8));
  ((($n +> 16) +| ($n+< 16)) +> (32 - 1 - $in.msb)) +& (my int $ = 0xffffffff);
}

…

  # next unless (my $s = ~$x) eq $s.flip;
  next unless $x == flip-bits($x);

您甚至可以尝试使用多线程。

请注意，此工作量太小，无法发挥作用。
使用线程的开销抵消了任何好处。

my atomicint $total = 0;

sub process ( int $s, int $e ) {
  # these are so the block lambda works properly
  # (works around what I think is a bug)
  my int $ = $s;
  my int $ = $e;

  start {
    my int $acc = 0;
    loop ( my int $x = $s; $x < $e; $x += 2) {
      next unless (my $s = ~$x) eq $s.flip;
      next unless $x.base(2) eq $x.base(2).flip;
      $acc += $x;
    }
    $total ⚛+= $acc;
  }
}


my int $cores = (Kernel.cpu-cores * 2.2).Int;

my int $per = 1_000_000 div $cores;
++$per if $per * $cores < 1_000_000;

my @promises;

my int $start = 1;
for ^$cores {
  my int $end = $start + $per - 2;
  $end = 1_000_000 if $end > 1_000_000;

  push @promises, process $start, $end;

#say $start, "\t", $end;

  $start = $end + 2;
}

await @promises;
say $total;

运行时间约为 0.63 秒。
（我弄乱了 2.2 值以在我的计算机上找到接近最短的时间。）

为什么这两个执行相同操作的脚本之间存在如此大的性能差异？

Why is there such a large performance difference between these two scrips that do the same thing?

raku