为什么在展开的 ADD 循环内重新初始化一个寄存器会使其运行更快，即使循环内有更多指令？

Question

我有以下代码：

#include <iostream>
#include <chrono>

#define ITERATIONS "10000"

int main()
{
    /*
    ======================================
    The first case: the MOV is outside the loop.
    ======================================
    */

    auto t1 = std::chrono::high_resolution_clock::now();

    asm("mov 0, %eax\n"
        "mov 0, %ebx\n"
        "mov $" ITERATIONS ", %ecx\n"
        "lp_test_time1:\n"
        "   add %eax, %ebx\n" // 1
        "   add %eax, %ebx\n" // 2
        "   add %eax, %ebx\n" // 3
        "   add %eax, %ebx\n" // 4
        "   add %eax, %ebx\n" // 5
        "loop lp_test_time1\n");

    auto t2 = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();

    std::cout << time;

    /*
    ======================================
    The second case: the MOV is inside the loop (faster).
    ======================================
    */

    t1 = std::chrono::high_resolution_clock::now();

    asm("mov 0, %eax\n"
        "mov $" ITERATIONS ", %ecx\n"
        "lp_test_time2:\n"
        "   mov 0, %ebx\n"
        "   add %eax, %ebx\n" // 1
        "   add %eax, %ebx\n" // 2
        "   add %eax, %ebx\n" // 3
        "   add %eax, %ebx\n" // 4
        "   add %eax, %ebx\n" // 5
        "loop lp_test_time2\n");

    t2 = std::chrono::high_resolution_clock::now();
    time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    std::cout << '\n' << time << '\n';
}

第一个案例

我用

编译的

gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu

gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp

它的输出是

14474
5837

我也是用Clang编译的，结果一样。

那么，为什么第二种情况更快（几乎 3 倍加速）？它实际上与一些微架构细节有关吗？如果重要的话，我有一个 AMD 的 CPU：“AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G”。

Answer 1

mov 0, %ebx循环内部通过ebx打破了循环携带的依赖链，允许乱序执行重叠5个add的链跨多次迭代的说明。

没有它，add 指令链会在 add（1 个周期）关键路径的延迟上阻碍循环，而不是吞吐量（挖掘机上的 4/周期，从 2/在压路机上循环）。你的 CPU 是 Excavator core.

AMD 因为 Bulldozer 有一个高效的 loop 指令（只有 1 uop），不像 Intel CPUs，其中 loop 会在每 7 个周期迭代 1 次时成为循环瓶颈。（https://agner.org/optimize/ 用于说明表、微架构指南以及有关此答案中所有内容的更多详细信息。）

随着 loop 和 mov 将前端（和后端执行单元）中的位置从 add 中移开，3 倍而不是 4 倍的加速看起来是正确的。

有关 CPU 如何找到和利用指令级并行 (ILP) 的介绍，请参阅 this answer。

有关重叠独立 dep 链的一些深入细节，请参阅。

顺便说一句，10k 次迭代并不多。您的 CPU 在那段时间甚至可能不会从怠速加速。或者可能会跳到第二个循环的大部分时间的最大速度，但第一个循环的 none。所以要小心这样的微基准测试。

此外，您的内联汇编也不安全，因为您忘记在 EAX、EBX 和 ECX 上声明破坏。你在不告诉它的情况下踩到编译器的寄存器。通常，您应该始终在启用优化的情况下进行编译，但如果您这样做，您的代码可能会中断。

为什么在展开的 ADD 循环内重新初始化一个寄存器会使其运行更快，即使循环内有更多指令？

Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

performance

x86

assembly

cpu-architecture

为什么在展开的 ADD 循环内重新初始化一个寄存器会使其 运行 更快，即使循环内有更多指令？

Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

performance

x86

assembly

cpu-architecture

为什么在展开的 ADD 循环内重新初始化一个寄存器会使其运行更快，即使循环内有更多指令？