同一个C程序的三个版本，为什么第一个这么快？

Question

这是一个非常简单的 C 程序：

int main()
{
    int n = 0;

    while(n != 1000000000){
        n += 1;
    }

return n;

}

我是用Clang编译的，并计时的。它运行 4.711095243692398e-06 秒，或 0.000004711095243692398 秒。

接下来，我使用 Godbolt 编译器资源管理器 (https://godbolt.org) 将 C 程序输出为英特尔语法汇编语言，以删除 .cfi 指令：

.file   "Svx.c"
.intel_syntax noprefix
.text
.globl  main
.type   main, @function
main:
    push    rbp
    mov rbp, rsp
    mov DWORD PTR -4[rbp], 0
    jmp .L2
.L3:
    add DWORD PTR -4[rbp], 1
.L2:
    cmp DWORD PTR -4[rbp], 1000000000
    jne .L3
    mov eax, DWORD PTR -4[rbp]
    pop rbp
    ret
    .size   main, .-main
    .ident  "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
    .section    .note.GNU-stack,"",@progbits

我是用GCC编译的，然后计时。结果是 1.96 秒——比 Clang 版本慢得多。

最后，我创建了自己的汇编版本：

[BITS 64]
[default rel]

global main:function

section .data align=16

section .text

main:
xor rax,rax
l_01:
cmp rax,1000000000
je l_02
add rax,5
jmp l_01
l_02:
ret

我用nasm编译它并用ld链接它：

sudo nasm -felf64 Svx.asm

sudo ld -shared Svx.o -o Svx.so

并计时。它运行 0.14707629615440965 秒。

如果反向编译版本运行慢得多（0.0000047 秒 vs 1.96 秒）而我的 NASM 版本运行s 在 0.147 秒内？我感觉 C 版本在 0.0000047 秒的结果是错误的；这似乎快得不可思议。这是它对汇编语言的 Clang 输出：

    .text
    .intel_syntax noprefix
    .file   "Svx.c"
    .globl  main                     # -- Begin function main
    .p2align    4, 0x90
    .type   main,@function
main:                                    # @main
    .cfi_startproc
# %bb.0:
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset rbp, -16
    mov rbp, rsp
    .cfi_def_cfa_register rbp
    mov dword ptr [rbp - 4], 0
.LBB0_1:                                # =>This Inner Loop Header:     Depth=1
    cmp dword ptr [rbp - 4], 1000000000
    je  .LBB0_3
# %bb.2:                                #   in Loop: Header=BB0_1   Depth=1
    mov eax, dword ptr [rbp - 4]
    add eax, 1
    mov dword ptr [rbp - 4], eax
    jmp .LBB0_1
.LBB0_3:
    mov eax, dword ptr [rbp - 4]
    pop rbp
    .cfi_def_cfa rsp, 8
    ret
.Lfunc_end0:
    .size   main, .Lfunc_end0-main
    .cfi_endproc
                                        # -- End function
    .ident  "clang version 8.0.0-3~ubuntu18.04.1 (tags/RELEASE_800/final)"
    .section    ".note.GNU-stack","",@progbits
    .addrsig

列表显示他们使用变量堆栈，而不是寄存器，这（通常）速度较慢。

速度 0.0000047 秒，数到十亿似乎快得不可思议。如果这个速度是正确的，它的秘密是什么？逆向工程没有揭示任何东西，事实上 Godbolt 版本要慢得多。

Answer 1

Clang 只是意识到这个循环是运行 1000000000 次，并且相当于 return 1000000000;.

我使用 -O3 的输出，正如您指定的那样：

        .text
        .file   "test.c"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        movl    00000000, %eax       # imm = 0x3B9ACA00
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .ident  "clang version 8.0.0 (tags/RELEASE_800/final)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

注意main的内容：

# %bb.0:
        movl    00000000, %eax       # imm = 0x3B9ACA00
        retq

这将完全删除循环，只删除 returns 1000000000。

解决这个问题的一个技巧是使用 volatile:

int main(void)
{
    volatile int n = 0;

    while(n != 1000000000) {
        n += 1;
    }

    return n;
}

输出（再次使用 -O3）：

        .text
        .file   "test.c"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        movl    [=13=], -4(%rsp)
        movl    -4(%rsp), %ecx
        movl    -4(%rsp), %eax
        cmpl    00000000, %ecx       # imm = 0x3B9ACA00
        je      .LBB0_3
        .p2align        4, 0x90
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        addl    , %eax
        movl    %eax, -4(%rsp)
        movl    -4(%rsp), %ecx
        movl    -4(%rsp), %eax
        cmpl    00000000, %ecx       # imm = 0x3B9ACA00
        jne     .LBB0_1
.LBB0_3:
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .ident  "clang version 8.0.0 (tags/RELEASE_800/final)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

Answer 2

您有 3 个案例：

启用优化的 C：用其结果替换循环：mov eax, 1000000000 / ret。你测量的时间都是开销。编译器可以很容易地证明 n 必须具有该值才能退出循环，并且循环不是无限的。
C with optimization disabled: 将 C 变量保存在内存中，包括循环计数器。在现代 x86 CPU 上，存储转发延迟的循环瓶颈是每次迭代大约 6 个周期。 (https://agner.org/optimize/)
循环（低效）但至少将值保存在寄存器中的手写 asm。所以循环携带的依赖链（递增n）只有1个循环延迟。

并且由于您使用 add rax,5，您只执行了 n++ 循环迭代的 1/5。您可以将其视为按 5 展开，然后将 5x n++ 优化为 n+=5。您可以根据需要将此因子设置得尽可能大，并使运行ning 时间任意小，直到您像编译器那样达到 mov eax, 1000000000。

参见 the first 2 on Godbolt，其中我使用了 clang -O3 和 gcc -O0。请注意，int n 是标准 x86-64 ABI 中的 32 位变量，因此无需为 64 位操作数大小花费额外的代码大小（REX 前缀）。

请参阅了解为什么高效的简单循环在底部有条件分支而没有无条件分支。请注意，这就是 gcc 甚至在 -O0 处所做的事情（以 jmp 进入循环到底部的循环条件）。

Clang 在 -O0 处编写了更简单的代码，它与 C 具有相同的结构，在顶部有一个中断条件，在底部有一个无条件的 jmp，就像你手写的 asm .

所以你的 asm 应该比反优化的 C 编译器输出快大约 6 * 5 倍，如果它不能达到运行，或者 的一半您的 NASM 循环每次迭代 1 个时钟周期。实际上，您测得的系数为 13.333，非常接近 15。

所以您可能在 Haswell 之前有 Intel，在 Ryzen 之前有 AMD。如果至少有一个分支未被采用，则较新的 CPU 具有每个时钟 2 个分支的吞吐量。

或者 Skylake（循环缓冲区被微码更新禁用）和一些前端效果（比如循环跨越 64 字节边界）阻止它以 1 迭代/时钟发布，因为它不能'从 uop 缓存中读取的速度不够快。

同一个C程序的三个版本，为什么第一个这么快？

Three versions of the same C program, why is the first one so fast?

c

x86

assembly

reverse-engineering