为什么 32 字节的循环对齐使代码更快？

Question

看这段代码：

one.cpp:

bool test(int a, int b, int c, int d);

int main() {
        volatile int va = 1;
        volatile int vb = 2;
        volatile int vc = 3;
        volatile int vd = 4;

        int a = va;
        int b = vb;
        int c = vc;
        int d = vd;

        int s = 0;
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        for (int i=0; i<2000000000; i++) {
                s += test(a, b, c, d);
        }

        return s;
}

two.cpp:

bool test(int a, int b, int c, int d) {
        // return a == d || b == d || c == d;
        return false;
}

one.cpp 中有 16 个 nop。您可以 comment/decomment 它们在 16 和 32 之间更改循环入口点的对齐方式。我用 g++ one.cpp two.cpp -O3 -mtune=native.

编译了它们

这是我的问题：

32 对齐版本比 16 对齐版本更快。在 Sandy Bridge 上，差异为 20%；在哈斯韦尔，8%。为什么不同？
对于 32 对齐版本，代码在 Sandy Bridge 上运行的速度相同，与 two.cpp 中的哪个 return 语句无关。我认为 return false 版本至少应该快一点。但是不，完全一样的速度！
如果我从 one.cpp 中删除 volatiles，代码会变慢（Haswell：之前：~2.17 秒，之后：~2.38 秒）。这是为什么？但是，当循环对齐到 32.

仅

32 位对齐版本更快这一事实对我来说很奇怪，因为 Intel® 64 和 IA-32 架构优化参考手册说（第 3-9 页）：

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16- byte aligned.

另一个小问题：是否有任何技巧可以使此循环仅 32 位对齐（因此其余代码可以继续使用 16 字节对齐）？

注意：我试过编译器 gcc 6、gcc 7 和 clang 3.9，结果相同。

这是带volatile的代码（16/32对齐的代码是一样的，只是地址不同）：

0000000000000560 <main>:
 560:   41 57                   push   r15
 562:   41 56                   push   r14
 564:   41 55                   push   r13
 566:   41 54                   push   r12
 568:   55                      push   rbp
 569:   31 ed                   xor    ebp,ebp
 56b:   53                      push   rbx
 56c:   bb 00 94 35 77          mov    ebx,0x77359400
 571:   48 83 ec 18             sub    rsp,0x18
 575:   c7 04 24 01 00 00 00    mov    DWORD PTR [rsp],0x1
 57c:   c7 44 24 04 02 00 00    mov    DWORD PTR [rsp+0x4],0x2
 583:   00 
 584:   c7 44 24 08 03 00 00    mov    DWORD PTR [rsp+0x8],0x3
 58b:   00 
 58c:   c7 44 24 0c 04 00 00    mov    DWORD PTR [rsp+0xc],0x4
 593:   00 
 594:   44 8b 3c 24             mov    r15d,DWORD PTR [rsp]
 598:   44 8b 74 24 04          mov    r14d,DWORD PTR [rsp+0x4]
 59d:   44 8b 6c 24 08          mov    r13d,DWORD PTR [rsp+0x8]
 5a2:   44 8b 64 24 0c          mov    r12d,DWORD PTR [rsp+0xc]
 5a7:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
 5ac:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 5b3:   00 00 00 
 5b6:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 5bd:   00 00 00 
 5c0:   44 89 e1                mov    ecx,r12d
 5c3:   44 89 ea                mov    edx,r13d
 5c6:   44 89 f6                mov    esi,r14d
 5c9:   44 89 ff                mov    edi,r15d
 5cc:   e8 4f 01 00 00          call   720 <test(int, int, int, int)>
 5d1:   0f b6 c0                movzx  eax,al
 5d4:   01 c5                   add    ebp,eax
 5d6:   83 eb 01                sub    ebx,0x1
 5d9:   75 e5                   jne    5c0 <main+0x60>
 5db:   48 83 c4 18             add    rsp,0x18
 5df:   89 e8                   mov    eax,ebp
 5e1:   5b                      pop    rbx
 5e2:   5d                      pop    rbp
 5e3:   41 5c                   pop    r12
 5e5:   41 5d                   pop    r13
 5e7:   41 5e                   pop    r14
 5e9:   41 5f                   pop    r15
 5eb:   c3                      ret    
 5ec:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

没有挥发性：

0000000000000560 <main>:
 560:   55                      push   rbp
 561:   31 ed                   xor    ebp,ebp
 563:   53                      push   rbx
 564:   bb 00 94 35 77          mov    ebx,0x77359400
 569:   48 83 ec 08             sub    rsp,0x8
 56d:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
 574:   00 00 
 576:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 57d:   00 00 00 
 580:   b9 04 00 00 00          mov    ecx,0x4
 585:   ba 03 00 00 00          mov    edx,0x3
 58a:   be 02 00 00 00          mov    esi,0x2
 58f:   bf 01 00 00 00          mov    edi,0x1
 594:   e8 47 01 00 00          call   6e0 <test(int, int, int, int)>
 599:   0f b6 c0                movzx  eax,al
 59c:   01 c5                   add    ebp,eax
 59e:   83 eb 01                sub    ebx,0x1
 5a1:   75 dd                   jne    580 <main+0x20>
 5a3:   48 83 c4 08             add    rsp,0x8
 5a7:   89 e8                   mov    eax,ebp
 5a9:   5b                      pop    rbx
 5aa:   5d                      pop    rbp
 5ab:   c3                      ret    
 5ac:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Answer 1

这没有回答第 2 点（return a == d || b == d || c == d; 与 return false 的速度相同）。这仍然是一个可能有趣的问题，因为它必须编译多行 uop-cache 指令。

The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 32]

该优化指南建议是一个非常通用的指南，绝对并不意味着更大的建议永远无济于事。通常情况下不会，填充到 32 弊大于利。（I-缓存未命中、ITLB 未命中以及要从磁盘加载的更多代码字节）。

事实上，很少需要16B 对齐，尤其是在具有uop 缓存的CPU 上。对于可以从循环缓冲区中运行的小循环，它的对齐通常是完全无关紧要的。

16B 作为一个广泛的建议仍然不错，但它并没有告诉您了解几个特定 CPU 上的特定情况所需了解的一切。

编译器通常默认对齐循环分支和函数入口点，但通常不对齐其他分支目标。执行 NOP（和代码膨胀）的成本通常大于未对齐的非循环分支目标的可能成本。

代码对齐有一些直接的和一些间接的影响。直接影响包括 Intel SnB 系列上的 uop 缓存。例如，参见 Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs。

Intel's optimization manual 的另一部分详细介绍了 uop 缓存的工作原理：

2.3.2.2 Decoded ICache:

All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).

A multi micro-op instruction cannot be split across Ways.

An instruction which turns on the MSROM consumes an entire Way.

Up to two branches are allowed per Way.

A pair of macro-fused instructions is kept as one micro-op.

另见 Agner Fog's microarch guide。他补充道：

An unconditional jump or call always ends a μop cache line

lots of other stuff that that probably isn't relevant here.

此外，如果您的代码不适合 uop 缓存，它就不能运行来自循环缓冲区。

对齐的间接影响包括：

larger/smaller 代码大小（L1I 高速缓存未命中，TLB）。与您的测试无关
BTB（Branch Target Buffer）中哪些分支互为别名。

If I remove volatiles from one.cpp, code becomes slower. Why is that?

较大的指令将最后一条指令跨过 32B 边界推入循环：

 59e:   83 eb 01                sub    ebx,0x1
 5a1:   75 dd                   jne    580 <main+0x20>

因此，如果您不是运行从循环缓冲区 (LSD) 中获取，那么如果没有 volatile，其中一个 uop 缓存获取周期只会获得 1 个 uop。

如果 sub/jne 宏融合，这可能不适用。而且我认为只有跨越 64B 边界才会破坏宏融合。

此外，这些不是真实地址。你检查过 linking 之后的地址了吗？如果文本部分的对齐小于 64B，则 linking 之后可能会有 64B 边界。

抱歉，我还没有实际测试过这个来详细说明这个具体案例。关键是，当你在前端遇到诸如 call/ret 之类的东西在一个紧密的循环中时， 对齐变得很重要并且可以变得非常复杂 .所有未来指令的边界交叉与否都会受到影响。不要期望它很简单。如果你读过我的其他答案，你就会知道我通常不是那种会说 "it's too complicated to fully explain" 的人，但可以这样对齐。

另见 Code alignment in one object file is affecting the performance of a function in another object file

在你的情况下，确保内联微型函数。 如果您的代码库在单独的 .c 文件中而不是在可以内联的 .h 中有任何重要的小函数，请使用 link 时间优化。 或者更改您的代码以将它们放入 .h.

为什么 32 字节的循环对齐使代码更快？

Why does loop alignment on 32 byte make code faster?

performance

benchmarking

gcc

x86-64

clang