通过 x86-64 汇编程序管道时的竞争条件

Question

我在汇编中编写了 cat 的以下简化实现。它使用 linux 系统调用，因为我运行宁 linux。这是代码：

.section .data
.set MAX_READ_BYTES, 0xffff

.section .text
.globl _start

_start:
    movq (%rsp), %r10 # save the value of argc somewhere else
    movq 16(%rsp), %r9 # save the value of argv[1] somewhere else

    movl , %eax # syscall 12 is brk. see brk(2)
    xorq %rdi, %rdi # call with 0 as first arg to get current end of memory
    syscall
    movq %rax, %r8 # this is the address of the current end of memory

    leaq MAX_READ_BYTES(%rax), %rdi # let this be the new end of memory
    movl , %eax # syscall 12, brk
    syscall
    cmp %r8, %rax # compare the two; if the allocation failed, these will be equal
    je exit

    leaq -MAX_READ_BYTES(%rax), %r13 # store the start of the free area in %r13

    movq %r10, %rdi # retrieve the value of argc
    cmpq [=10=]x01, %rdi # if there are no cli args, process stdin instead
    je stdin

    # open the file
    movl [=10=]x02, %eax # syscall #2 = open.
    movq %r9, %rdi
    movl [=10=], %esi # second argument: flags. 0 means read-only.
    xorq %rdx, %rdx # this argument isn't used here, but zero it out for peace of mind.
    syscall # returns the file descriptor number in %rax
    movl %eax, %edi
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

stdin:
    movl [=10=]x0000, %edi # first argument: file descriptor.
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

read_and_write:
    # read the file.
    movl [=10=], %eax # syscall #0 = read.
    movl %r12d, %edi
    movq %r13 /* pointer to allocated memory */, %rsi # second argument: address of a writeable buffer.
    movl $MAX_READ_BYTES, %edx # third argument: number of bytes to write.
    syscall # num bytes read in %rax
    movl %eax, %r15d

    # print the file
    movl , %eax # syscall #1 = write.
    movl , %edi # first argument: file descriptor. 1 is stdout.
    movq %r13, %rsi # second argument: address of data to write.
    movl %r15d, %edx # third argument: number of bytes to write.
    syscall # result ignored.
    cmpq $MAX_READ_BYTES, %r15
    je read_and_write
    ret

cleanup:
    # close the file
    movl [=10=]x03, %eax # syscall #3 = close.
    movl %r14d, %edi # first arg: file descriptor number.
    syscall # result ignored.

exit:
    # set the exit code
    movl , %eax # syscall #60 = exit.
    movq [=10=], %rdi # exit 0 = success.
    syscall

我已将其组装成一个名为 asmcat 的 ELF 二进制文件。为了测试这个程序，我得到了文件 /tmp/random:

$ wc -c /tmp/random
94870 /tmp/random

当我运行下面的时候，结果是一致的：

$ ./asmcat /tmp/random | wc -c
94870

这是同一命令的两个不同的运行：

$ cat /tmp/random | ./asmcat | wc -c
65536

$ cat /tmp/random | ./asmcat | wc -c
94870

将输出重定向到文件始终生成相同大小的文件：

for i in {0..25}; do
    cat /tmp/random | ./asmcat > /tmp/asmcat-output-$i
done
for i in {0..25}; do
    wc -c /tmp/asmcat-output-$i
done

所有生成的文件都具有相同的大小，94870。这使我相信 wc 的管道是导致不一致行为的原因。我的程序应该做的就是一次读取 stdin，65535 字节，然后写入 stdout。程序中可能存在错误，但是为什么它会始终如一地重定向到大小一致的文件？所以我强烈的感觉是管道的某些东西导致我的汇编程序输出大小的测量不一致。

欢迎任何反馈，包括汇编程序中采用的方法（我刚刚为 fun/practice 编写）。

Answer 1

TL:DR：如果您的程序在 cat 可以重新填充管道缓冲区之前执行两次读取，
第二次读取仅获得 1 个字节。这使您的程序决定过早退出。

这才是真正的错误。使这成为可能的其他设计选择是性能问题，而不是正确性。

您的程序在任何短读（return 值小于请求的大小）后停止，而不是等待 EOF（read() == 0）。这种简化有时对常规文件是安全的，但对其他任何东西都不安全，尤其是 TTY（终端输入），但对管道或套接字也不安全。例如尝试运行宁 ./asmcat；在一行中按 return 后它会退出，而不是等待 control-D EOF。

Linux 管道缓冲区默认只有 64kiB (pipe(7) man page)，比奇怪的奇数缓冲区大 1 个字节您正在使用。 cat 的写入填满管道缓冲区后，您的 65535 字节读取还剩下 1 个字节。如果您的程序赢得 read 管道的竞争，然后 cat 可以再次写入，它只读取 1 个字节。

不幸的是，运行ning 在 strace ./asmcat 下会使读取速度减慢太多而无法观察到短读，除非您也减慢 cat 或任何其他程序以限制速率输入管道的写入端。

用于测试的限速输入：

pv(1) 管道查看器对此很方便，具有速率限制 -L 选项和缓冲区大小限制，因此您可以确保其写入小于 64k。（很少进行较大的 64k 写入可能并不总是会导致读取较短。）但是如果我们总是希望读取较短，运行从终端交互式读取就更容易了。 strace ./asmcat

$ pv -L8K -B16K /tmp/random | strace ./orig_asmcat | wc -c
execve("./orig_asmcat", ["./orig_asmcat"], 0x7ffcd441f750 /* 55 vars */) = 0
brk(NULL)                               = 0x61c000
brk(0x62bfff)                           = 0x62bfff
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65535) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
close(0)                                = 0
exit(0)                                 = ?
+++ exited with 0 +++   # end of strace output
819                     # wc output
 819 B 0:00:00 [4.43KiB/s] [>              ]  0%        # pv's progress bar

对比修复了错误 asmcat，我们得到了预期的短读和等长写序列。（我的版本见下文）

execve("./asmcat", ["./asmcat"], 0x7ffd8c58f600 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
read(0, "check if a\nnamed variable exists"..., 65536) = 819
write(1, "check if a\nnamed variable exists"..., 819) = 819

代码审查

有多个指令被浪费，例如mov 写一个你再也读不到的寄存器，比如在调用之前设置 EDI，但是函数调用将 R12D 作为 arg，而不是标准调用约定。

尽早读取 argc、argv 而不是将它们留在堆栈中直到需要它们同样是多余的。

.data 毫无意义：.set 是一个 assemble 时间常数。定义它时，当前部分是什么并不重要。你也可以把它写成 MAX_READ_BYTES = 0xffff，更自然的语法 assemble-time constants.

您可以在堆栈上分配您的缓冲区而不是使用 brk（它只有 64K - 1，而 x86-64 Linux 默认允许 8MiB 堆栈），在这种情况下，尽早加载可能是有意义的。或者只使用 BSS，例如lcomm buf, 1<<16

最好使您的缓冲区成为 2 的幂，或者至少是页面大小 (4k) 的倍数，以提高效率。如果你用它来复制文件，第一次读取之后的每次读取都将在页面末尾附近开始，而不是复制整个 4k 页，所以内核的 copy_to_user （读取）和 copy_from_user （写入）每个 read/write 将触及 17 页内核内存而不是 16 页。文件数据的页面缓存可能不在连续的内核地址中，因此每个单独的 4k 页都需要一些开销来查找，并启动一个单独的memcpy 用于（rep movsb 在具有 ERMSB 功能的现代 CPU 上）。同样对于磁盘 I/O，内核将不得不将您的写入缓冲到硬件扇区大小 and/or 文件系统块大小的一些倍数的对齐块中。

64KiB 显然是从管道读取时的一个不错的选择，出于同样的原因，这场比赛是可能的。留下 1 个字节显然是低效的。此外，64k 小于 L2 缓存大小，因此当您再次写入时，副本 to/from user-space（在您的系统调用中的内核内部）可以从 L2 缓存中重新读取。但是更小的尺寸意味着更多的系统调用，并且每个系统调用都有很大的开销（尤其是在现代内核中缓解 Meltdown 和 Spectre 的情况下。）

64KiB 到 128KiB 是缓冲区大小的最佳选择 ，因为 256KiB L2 缓存是典型的。（相关：code golf: Fastest yes in the West 使用 x86-64 Linux 调整一个只进行 write 系统调用的程序，并在我的 Skylake 桌面上提供分析/基准测试结果。）

机器代码中没有任何东西像 0xFFFF 那样受益于 uint16_t 中的大小拟合； int8_t 或 int32_t 与 64 位代码中的立即操作数大小相关。（或者 uint32_t 如果你像 mov $imm32, %edx 那样零扩展到 RDX 零扩展。）

不要关闭stdin；你运行close无条件。关闭 stdin 不会影响父进程的 stdin，所以它在这个程序中应该不是问题，但 close 的全部意义似乎是让它更像是一个你可以在大型程序中使用的函数。因此，您应该将复制 fd 与文件处理分开到标准输出。

使用#include <asm/unistd.h>获取电话号码而不是硬编码。它们保证稳定，但仅使用命名常量更易于人类阅读/自我记录，并避免任何复制错误的风险。 (Build with gcc -nostdlib -static asmcat.S -o asmcat; GCC 运行s .S 文件在汇编前通过 C 预处理器，不像 .s)

样式：我喜欢将操作数缩进到一致的列中，这样它们就不会挤满助记符。同样，注释应该舒适地位于操作数的右侧，这样您就可以向下浏览该列以查找访问任何给定寄存器的指令，而不会被较短指令的注释分散注意力。

评论内容：指令本身已经说明了它的作用，评论应该描述语义。（我不需要评论来提醒我调用约定，比如系统调用在 RAX 中留下结果，但即使你这样做，用它的 C 版本总结系统调用也可以很好地提醒哪个 arg 是哪个.喜欢open(argv[1], O_RDONLY).)

我也喜欢删除多余的操作数大小后缀；寄存器大小暗示操作数大小（就像 Intel 语法一样）。请注意，将 64 位寄存器置零只需要 xorl；写入 32 位寄存器隐式零扩展到 64 位。您的代码有时对于应该是 32 位还是 64 位不一致。在我的重写中，我尽可能地使用了 32 位。（除了写入的 cmp %rax, %rdx return 值，这似乎是制作 64 位的好主意，尽管我认为没有任何真正的原因。）

我的重写：

我删除了 call/ret 的东西，让它落入 cleanup/exit 而不是试图将它分成“功能”。

我还将缓冲区大小准确地更改为 64KiB，在堆栈上分配 4k 页面对齐，并重新安排内容以简化并在各处保存指令。

还添加了一条 # TODO 关于简短写作的评论。对于高达 64k 的管道写入，这似乎不会发生； Linux 只是阻止写入，直到缓冲区有空间，但写入套接字可能会出现问题？或者可能只有更大的尺寸，或者如果像 SIGTSTP 或 SIGSTOP 这样的信号中断 write()

#include <asm/unistd.h>
BUFSIZE = 1<<16

.section .text
.globl _start
_start:
    pop  %rax      # argc
    pop  %rdi
    pop  %rdi      # argv[1]
     # you'd only ever want to read args this way in _start, which isn't a function

    and  $-4096, %rsp           # round RSP down to a page boundary.
    sub  $BUFSIZE, %rsp         # reserve 64K buffer aligned by 4k

    dec  %eax      # if argc == 1,  then run with input fd = 0   (stdin)
    jz  .Luse_stdin

    # open argv[1]
    mov     $__NR_open, %eax 
    xor     %esi, %esi     # flags: 0 means read-only.
    xor     %edx, %edx     # mode unused without O_CREAT, but zero it out for peace of mind.
    syscall       # fd = open(argv[1], O_RDONLY)

.Luse_stdin:           # don't use stdin as a symbol name; stdio.h / libc also has one of type FILE*
    mov  %eax, %ebx     # save FD
    mov  %rsp, %rsi     # always read and write the same buffer
    jmp  .Lentry        # start with a read then EOF-check as loop condition
              # since we're now error-checking the write,
              # rotating the loop maybe wasn't helpful after all
              # and perhaps just read at the top so we can fall into it would work equally well

read_and_write:              # do {
    # print the file
    mov     %eax, %edx             # size = read_size
    mov     $__NR_write, %eax      # syscall #1 = write.
    mov     , %edi               # output fd always stdout
    #mov     %rsp, %rsi             # buf, done once outside loop
    syscall                        # write(1, buf, read_size)

    cmp     %rax, %rdx             # written size should match request
    jne     cleanup                 # TODO: handle short writes by calling again for the unwritten part of the buffer, e.g. add %rax, %rsi
                                    # but also check for write errors.
.Lentry:
     # read the file.
    mov    $__NR_read, %eax     # xor  %eax, %eax
    mov    %ebx, %edi           # input FD
   # mov    %rsp, %rsi           # done once outside loop
    mov    $BUFSIZE, %edx
    syscall                     # size = read(fd, buf, BUFSIZE)

    test   %eax, %eax
    jg     read_and_write    # }while(read_size > 0);   // until EOF or error
# any negative can be assumed to be an error, since we pass a size smaller than INT_MAX

cleanup:
# fd might be stdin which we don't want to close.
# just exit and let kernel take care of it, or check for fd==0
#    movl $__NR_close, %eax
#    movl %ebx, %edi 
#    syscall          # close (fd)  // return value ignored

exit:
    mov  %eax, %edi             # exit status = last syscall return value. read() = 0 means EOF, success.
    mov  $__NR_exit_group, %eax
    syscall                     # exit_group(status);

对于指令计数，perf stat --all-user ./asmcat /tmp/random > /dev/null 显示它运行在用户 space 中大约有 47 条指令，而你的是 57 条。（IIRC，perf over-counts by 1，所以我从测量结果中减去它。）而且还有更多的错误检查，例如用于短写。

这只是 .text 部分中的 84 字节机器码（而您的原始部分为 174 字节），而且我没有使用 lea 1(%rsi), %eax 之类的东西（在将 RSI 归零之后）来优化大小而不是速度) 而不是 mov , %eax。（或者用 mov %eax, %edi 来利用 _NR_write == STDIN_FILENO。）

我主要避免使用 R8..R15，因为它们需要 REX 前缀才能在机器代码中访问。

错误处理测试：

$ gcc -nostdlib -static asmcat.S -o asmcat            # build
$ cat /tmp/random | strace ./asmcat > /dev/full

execve("./asmcat", ["./asmcat"], 0x7ffde5e369d0 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 65536
write(1, "=head1 NAME\n\n=for comment  Gener"..., 65536) = -1 ENOSPC (No space left on device)
exit_group(-28)                         = ?
+++ exited with 228 +++

$ strace ./asmcat <&-      # close stdin
execve("./asmcat", ["./asmcat"], 0x7ffd0f5048c0 /* 55 vars */) = 0
read(0, 0x7ffc1b3ca000, 65536)          = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

$ strace ./asmcat /noexist
execve("./asmcat", ["./asmcat", "/noexist"], 0x7ffd429f1158 /* 55 vars */) = 0
open("/noexist", O_RDONLY)              = -1 ENOENT (No such file or directory)
read(-2, 0x7ffd4f296000, 65536)         = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

嗯，如果你想进行错误处理，打开后可能应该 test/jl 在 fd 上。

通过 x86-64 汇编程序管道时的竞争条件

Race condition when piping through x86-64 assembly program

linux

assembly

x86-64

pipe

system-calls

用于测试的限速输入：

代码审查

我的重写：