堆栈分配功能（性能）

Question

在我的小性能问题调查中，我注意到一个有趣的堆栈分配功能，这是测量时间的模板：

#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int x; //for simple optimization suppression
void foo();

int main()
{   
    const size_t n = 10000000; //ten millions
    auto start = high_resolution_clock::now();

    for (size_t i = 0; i < n; i++)
    {
        foo();
    }

    auto finish = high_resolution_clock::now();
    cout << duration_cast<milliseconds>(finish - start).count() << endl;
}

现在是foo()实现，每次实现总共分配500000 ints:

分配在一个块中：

void foo()
{
    const int size = 500000;
    int a1[size];

    x = a1[size - 1];
}

结果：7.3秒;

分配在两个块中：

void foo()
{
    const int size = 250000;
    int a1[size];
    int a2[size];

    x = a1[size - 1] + a2[size - 1];
}

结果：3.5秒;

分配在四个块中：

void foo()
{
    const int size = 125000;
    int a1[size];
    int a2[size];
    int a3[size];
    int a4[size];

    x = a1[size - 1] + a2[size - 1] +
        a3[size - 1] + a4[size - 1];
}

结果：1.8秒.

等等...我将其分成 16 个块 并得到结果时间 0.38 秒.

请给我解释一下，这是为什么以及如何发生的？
我使用 MSVC 2013 (v120)，发布版本。

UPD:
我的机器是x64平台。而且我是用Win32平台编译的。
当我用 x64 平台编译它时，它在所有情况下都会产生大约 40 毫秒。
为什么平台选择影响这么大？

Answer 1

查看 VS2015 更新 3 的反汇编，在 foo 的 2 和 4 数组版本中，编译器优化了未使用的数组，因此它只为每个数组中的 1 个数组保留堆栈 space功能。由于后面的函数具有较小的数组，因此花费的时间较少。对 x 的赋值读取 both/all 4 个数组的相同内存位置。（由于数组未初始化，从中读取是未定义的行为。）如果不优化代码，则会读取 2 或 4 个不同的数组。

这些函数花费的时间较长是由于 __chkstk 作为堆栈溢出检测的一部分执行的堆栈探测（当编译器需要超过 1 页的 space 来保存所有局部变量）。

Answer 2

您应该查看生成的汇编代码，了解您的编译器对这些代码的真正作用。对于 gcc/clang/icc，您可以使用 Matt Godbolt's Compiler Explorer。

clang 由于 UB 优化了所有内容，结果是（foo - 第一个版本，foo2 - 第二个版本：

foo:                                    # @foo
        retq

foo2:                                   # @foo2
        retq

icc 对待两个版本非常相似：

foo:
        pushq     %rbp                                          #4.1
        movq      %rsp, %rbp                                    #4.1
        subq      00000, %rsp                                #4.1
        movl      -4(%rbp), %eax                                #8.9
        movl      %eax, x(%rip)                                 #8.5
        leave                                                   #10.1
        ret                                                     #10.1

foo2:
        pushq     %rbp                                          #13.1
        movq      %rsp, %rbp                                    #13.1
        subq      00000, %rsp                                #13.1
        movl      -1000004(%rbp), %eax                          #18.9
        addl      -4(%rbp), %eax                                #18.24
        movl      %eax, x(%rip)                                 #18.5
        leave                                                   #19.1
        ret

和gcc为不同的版本创建不同的汇编代码。 6.1 版生成的代码将显示与您的实验类似的行为：

foo:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    00016, %rsp
        movl    1999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret
foo2:
        pushq   %rbp
        movl    00016, %edx  #only the first array is allocated
        movq    %rsp, %rbp
        subq    %rdx, %rsp
        leaq    3(%rsp), %rax
        subq    %rdx, %rsp
        shrq    , %rax
        movl    999996(,%rax,4), %eax
        addl    999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret

因此，了解差异的唯一方法是查看您的编译器生成的汇编代码，其他一切都只是猜测。

堆栈分配功能（性能）

Stack allocation feature (performance)

c

c++

performance

stack

allocation