使用 uniform_real_distribution 时 clang 性能下降

Question

以下代码在使用 uniform_real_distribution 时导致 g++ 和 clang++ 的时间非常不同。

#include <iostream>
#include <sstream>
#include <fstream>

#include <chrono>
#include <random>


std::mt19937::result_type seed = 0;
std::mt19937 gen(seed);
// std::uniform_int_distribution<size_t> distr(0, 1);
std::uniform_real_distribution<double> distr(0.0,1.0);

int main()
{
    auto t_start = std::chrono::steady_clock::now();
    for (auto i = 1; i <= 1000000; ++i)
    {
        distr(gen);
    }
    auto t_end = std::chrono::steady_clock::now();
    std::cout << "elapsed time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_end - t_start).count()  << " ns\n" << std::endl;

    return 0;
}

使用以下命令编译：

clang++ -std=c++17 -O3 -flto -march=native -mllvm -inline-threshold=10000000 rng.cpp -o rng
g++ -std=c++17 -O3 -march=native rng.cpp -o rng

这导致以下时间：

clang:  272929774 ns

gcc:    12054635 ns

当使用注释分布时，时间是：

clang:  48155862 ns

gcc:    50226810 ns

我在这里发现了一个相当古老的问题，它处理了同样的问题，但是 none 提出的解决方案适用于我的情况。

Clang performance drop for specific C++ random number generation

有人知道这里发生了什么吗？

Answer 1

看看godbolt

gcc 编译器已损坏 distr(gen);!!!

.L27:
        dec     esi
        je      .L25

这是什么都不做的for循环！

On clang 编译器不够智能：

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     edi, offset gen
        call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)
        dec     ebx
        jne     .LBB0_1

而generate_canonical实际上被调用了。

基本上你必须使用 distr(gen); 的结果来做一些对代码结果有影响的事情，否则编译器可以删除该代码。

simplest way to fix it 是累加 distr(gen); 的结果并打印出来。

现在当您查看汇编时，您可以看到 clang 正在调用函数 std::generate_canonical<double, 53ul, std::mersenne_twister_engine< .... >> 而 gcc 只是将相应的代码内联。

这种差异很可能是标准库的不同组织造成的。 Clang 使用内置于标准库中的版本，而头文件中的 gcc 模板用于在刚刚创建的程序集中生成代码。当编译器从库中访问外部代码时，它无法确定它到底做了什么，因此它无法优化掉该代码（因为一些副作用可能隐藏在库中）。

使用 uniform_real_distribution 时 clang 性能下降

clang performance drop when using uniform_real_distribution

c++

performance

gcc

distribution

clang++