为什么要以这种方式内联更改汇编代码？

Question

我写了一个非常简单的 C++ 程序来理解 "inline" 的工作原理：

inline int square(int x) {
    return x*x;
}

int main() {
    int y = square(1234);
    return y;
}

我把它编译成汇编代码，没有和有 "inline"。奇怪的是，在这两种情况下都生成了一个函数，但它是不同的。没有内联代码看起来像这样（删除大部分注释）：

_Z6squarei:                             # @_Z6squarei
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -4(%rbp)
    movl    -4(%rbp), %edi
    imull   -4(%rbp), %edi
    movl    %edi, %eax
    popq    %rbp
    retq
.Lfunc_end0:

main:                                   # @main
    pushq   %rbp
    movq    %rsp, %rbp
    subq    , %rsp
    movl    34, %edi             # imm = 0x4D2
    movl    [=11=], -4(%rbp)
    callq   _Z6squarei
    movl    %eax, -8(%rbp)
    movl    -8(%rbp), %eax
    addq    , %rsp
    popq    %rbp
    retq
.Lfunc_end1:

使用内联，它看起来像这样：

main:                                   # @main
    .cfi_startproc
    pushq   %rbp
.Lcfi0:
    .cfi_def_cfa_offset 16
.Lcfi1:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
.Lcfi2:
    .cfi_def_cfa_register %rbp
    subq    , %rsp
    movl    34, %edi             # imm = 0x4D2
    movl    [=12=], -4(%rbp)
    callq   _Z6squarei
    movl    %eax, -8(%rbp)
    movl    -8(%rbp), %eax
    addq    , %rsp
    popq    %rbp
    retq
.Lfunc_end0:

_Z6squarei:                             # @_Z6squarei
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -4(%rbp)
    movl    -4(%rbp), %edi
    imull   -4(%rbp), %edi
    movl    %edi, %eax
    popq    %rbp
    retq
.Lfunc_end1:

它非常相似，除了新的 "cfi" 指令。为什么它们只在我使用 "inline" 时才出现？

第二个问题：有没有办法告诉编译器真正使这个函数内联？（我使用的是 clang++-5.0）。

Answer 1

unsigned int fun0 ( unsigned int );

static unsigned int fun1 ( unsigned int x )
{
    return(x+1);
}

unsigned int fun2 ( unsigned int x )
{
    return(x+2);
}

inline unsigned int fun3 ( unsigned int x )
{
    return(x+3);
}

unsigned int hello ( unsigned int x )
{
    unsigned int y;
    y=fun0(x);
    y=fun1(y);
    y=fun2(y);
    y=fun3(y);
    return(y);
}

故意使用不同的指令集：

Disassembly of section .text:

00000000 <fun2>:
   0:   e2800002    add r0, r0, #2
   4:   e12fff1e    bx  lr

00000008 <hello>:
   8:   e92d4010    push    {r4, lr}
   c:   ebfffffe    bl  0 <fun0>
  10:   e8bd4010    pop {r4, lr}
  14:   e2800006    add r0, r0, #6
  18:   e12fff1e

fun0() 是外部的，编译器在那里不可见，它必须设置调用并获取 return 值。

fun1() 被标记为静态的，所以我们已经指出我们希望该函数是这个 object/file/scope 的本地函数，因此编译器没有理由在那里创建一个函数供其他人远程访问，并且优化器可以看到它在同一个文件中的函数，因此选择内联它。

fun2() 没有特殊标记，它被假定为全局的，因此编译器需要提供执行该功能的代码供其他人使用，但同时优化器会看到该功能，它在同一个文件中, 因此将其优化为内联以及 fun1.

fun3() 我们表示编译器可以内联这个，有点暗示它是为了在这个范围内使用，所以像 static 编译器没有生成代码供全局使用，并且优化（内联）

功能上，hello 将 x 发送到 fun0()，fun0() 将其转换为 y。然后我们将 1+2+3 = 6 添加到它。因此，要内联 fun1、fun2、fun3，只需将 6 添加到 fun0() 的输出即可。这就是我们看到的 fun1() fun2() 和 fun3() 是内联的。

可能这里的混淆是inline是什么意思，是inline的意思。不要调用函数包含与调用者一致的功能。

unsigned int fun2 ( unsigned int x )
{
    return(x+2);
}

unsigned int hello ( unsigned int x )
{
    return(fun2(x));
}

使用我正在使用的工具，我实际上并不需要要求它内联

00000000 <fun2>:
   0:   e2800002    add r0, r0, #2
   4:   e12fff1e    bx  lr

00000008 <hello>:
   8:   e2800002    add r0, r0, #2
   c:   e12fff1e    bx  lr

优化器无论如何都这样做了，它没有设置对 fun2 的调用，而是利用了 fun2 的功能，即向操作数加 2，它只是在 hello IN LINE 中这样做。

使用你的工具，注意全局函数是以任何一种方式创建的，但是当你要求它内联时，它看起来并没有实际做任何事情，检查反汇编和汇编，反汇编通常更容易阅读，更少令人困惑。

注意，使用我的第一个示例和 C++ 编译器，所以我没有得到 "hey you didnt use a C++ compiler":

0000000000000000 <_Z4fun2j>:
   0:   8d 47 02                lea    0x2(%rdi),%eax
   3:   c3                      retq   
   4:   66 90                   xchg   %ax,%ax
   6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   d:   00 00 00 

0000000000000010 <_Z5helloj>:
  10:   48 83 ec 08             sub    [=14=]x8,%rsp
  14:   e8 00 00 00 00          callq  19 <_Z5helloj+0x9>
  19:   48 83 c4 08             add    [=14=]x8,%rsp
  1d:   83 c0 06                add    [=14=]x6,%eax
  20:   c3                      retq

同样的道理，inline 和 static 没有产生全局函数供他人使用。并且编译器生成了对外部函数的调用，然后向其添加了 6。

注意没有优化：

00000000 <fun1>:
   0:   e52db004    push    {r11}       ; (str r11, [sp, #-4]!)
   4:   e28db000    add r11, sp, #0
   8:   e24dd00c    sub sp, sp, #12
   c:   e50b0008    str r0, [r11, #-8]
  10:   e51b3008    ldr r3, [r11, #-8]
  14:   e2833001    add r3, r3, #1
  18:   e1a00003    mov r0, r3
  1c:   e28bd000    add sp, r11, #0
  20:   e49db004    pop {r11}       ; (ldr r11, [sp], #4)
  24:   e12fff1e    bx  lr

00000028 <fun2>:
  28:   e52db004    push    {r11}       ; (str r11, [sp, #-4]!)
  2c:   e28db000    add r11, sp, #0
  30:   e24dd00c    sub sp, sp, #12
  34:   e50b0008    str r0, [r11, #-8]
  38:   e51b3008    ldr r3, [r11, #-8]
  3c:   e2833002    add r3, r3, #2
  40:   e1a00003    mov r0, r3
  44:   e28bd000    add sp, r11, #0
  48:   e49db004    pop {r11}       ; (ldr r11, [sp], #4)
  4c:   e12fff1e    bx  lr

00000050 <hello>:
  50:   e92d4800    push    {r11, lr}
  54:   e28db004    add r11, sp, #4
  58:   e24dd010    sub sp, sp, #16
  5c:   e50b0010    str r0, [r11, #-16]
  60:   e51b0010    ldr r0, [r11, #-16]
  64:   ebfffffe    bl  0 <fun0>
  68:   e50b0008    str r0, [r11, #-8]
  6c:   e51b0008    ldr r0, [r11, #-8]
  70:   ebffffe2    bl  0 <fun1>
  74:   e50b0008    str r0, [r11, #-8]
  78:   e51b0008    ldr r0, [r11, #-8]
  7c:   ebfffffe    bl  28 <fun2>
  80:   e50b0008    str r0, [r11, #-8]
  84:   e51b0008    ldr r0, [r11, #-8]
  88:   ebfffffe    bl  0 <fun3>
  8c:   e50b0008    str r0, [r11, #-8]
  90:   e51b3008    ldr r3, [r11, #-8]
  94:   e1a00003    mov r0, r3
  98:   e24bd004    sub sp, r11, #4
  9c:   e8bd4800    pop {r11, lr}
  a0:   e12fff1e    bx  lr

称它们都没有内联...您在测试中使用了什么优化？如果您尝试优化怎么办？（llvm/clang 为您提供超过 gnu 的多个优化机会）

使用 llvm 和优化进行编辑。

两个单独的文件

unsigned int fun0 ( unsigned int x )
{
    return(x+7);
}

还有这个

unsigned int fun0 ( unsigned int );

inline unsigned int fun3 ( unsigned int x )
{
    return(x+3);
}

unsigned int hello ( unsigned int x )
{
    unsigned int y;
    y=fun0(x);
    y=fun3(y);
    return(y);
}

未优化构建

0000000000000000 : 0: 55 推 %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 89 7d fc mov %edi,-0x4(%rbp) 7: 8d 47 07 lea 0x7(%rdi),%eax a: 5d pop %rbp b: c3 retq

和

0000000000000000 <hello>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 83 ec 10             sub    [=18=]x10,%rsp
   8:   89 7d fc                mov    %edi,-0x4(%rbp)
   b:   e8 00 00 00 00          callq  10 <hello+0x10>
  10:   89 45 f8                mov    %eax,-0x8(%rbp)
  13:   89 c7                   mov    %eax,%edi
  15:   e8 00 00 00 00          callq  1a <hello+0x1a>
  1a:   89 45 f8                mov    %eax,-0x8(%rbp)
  1d:   48 83 c4 10             add    [=18=]x10,%rsp
  21:   5d                      pop    %rbp
  22:   c3                      retq

post 编译希望 fun0 被内联，哦，好吧，它确实优化了 hello

0000000000000000 <fun0>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   8d 47 07                lea    0x7(%rdi),%eax
   7:   5d                      pop    %rbp
   8:   c3                      retq   
   9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000000010 <hello>:
  10:   55                      push   %rbp
  11:   48 89 e5                mov    %rsp,%rbp
  14:   83 c7 07                add    [=19=]x7,%edi
  17:   e8 00 00 00 00          callq  1c <hello+0xc>
  1c:   5d                      pop    %rbp
  1d:   c3                      retq

编译优化。

0000000000000000 <fun0>:
   0:   8d 47 07                lea    0x7(%rdi),%eax
   3:   c3                      retq   

0000000000000000 <hello>:
   0:   50                      push   %rax
   1:   e8 00 00 00 00          callq  6 <hello+0x6>
   6:   83 c0 03                add    [=20=]x3,%eax
   9:   59                      pop    %rcx
   a:   c3                      retq

clang给你不同的优化机会。

好吧，随着文件数量的增加，llvm 工具的优化组合几乎呈指数级增长，对于更大的项目，我发现如果你编译未优化的，它会给后来的优化器更多的工作空间，但当然这取决于许多因素，不幸的是，组合变得惊人。如果我先使用优化进行编译，然后再进行组合和优化，我就会得到我想要的。

0000000000000000 <fun0>:
   0:   8d 47 07                lea    0x7(%rdi),%eax
   3:   c3                      retq   

0000000000000010 <hello>:
  10:   8d 47 0a                lea    0xa(%rdi),%eax
  13:   c3                      retq

fun3 添加了 3 fun0 添加了 7，对 fun0 的调用是内联的，我最终从两个文件中得到一个外部函数一个内部内联，只需添加 10。

我在这里使用了 C，但是 llvm/clang 就像 gnu 那只是一个前端，如上所示 gnu 在中间发生的事情应该独立于 C 和 C++ 表现相同（就自动优化或建议内联）。

为什么要以这种方式内联更改汇编代码？

Why is inline changing the assembly code in this way?

c++

assembly

inline