如何在汇编中将两个十六进制 128 位数字相乘

Question

我在内存中有两个128位的十六进制数字，例如（little endian）：

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

我必须在这两个数字之间执行无符号乘法，所以我的新数字将是：

z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

现在，我知道我可以将 x 和 y 的一半数字移动到 rax 和 rbx 寄存器中，例如，执行 mul 操作，然后执行另一半也一样。问题是这样做会丢失结转，我不知道如何避免这种情况。我遇到这个问题大约需要 4 个小时，我能看到的唯一解决方案是二进制转换 (and <-> shl,1)。

你能给我一些关于这个问题的意见吗？
我认为最好的解决方案是平均占用一个字节。

Answer 1

设μ = 2⁶⁴，那么我们可以分解你的128位数a和b 变成 a = a₁μ + a ₂ 和 b = b₁μ + b₂。然后我们可以通过首先计算部分积来计算 c = ab 与 64 · 64 → 128 位乘法：

q₁μ + q₂ = a₂b₂
r₁μ + r₂ = a₁b₂
s₁μ + s₂ = a₂b₁
t₁μ + t₂ = a₁b₁

然后将它们累加成一个 256 位的结果（做加法时注意溢出！）：

c = t₁μ³ + (t₂ + s₁ + r₁) μ² + (s₂ + r₂ + q₁) μ + q₂

Answer 2

像往常一样，问一个编译器如何高效地做某事：GNU C on 64-bit platforms supports __int128_t and __uint128_t.

__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }

编译为 (gcc6.2 -O3 on Godbolt)

   imul    rsi, rdx        # a_hi * b_lo
   mov     rax, rdi
   imul    rcx, rdi        # b_hi * a_lo
   mul     rdx             # a_lo * b_lo  widening multiply
   add     rcx, rsi        # add the cross products ...
   add     rdx, rcx        # ... into the high 64 bits.
   ret

由于这是针对 x86-64 System V 调用约定，因此 a 在 RSI:RDI 中，而 b 在 RCX:RDX 中。 返回结果在RDX:RAX.

它只需要一条 MOV 指令，这非常好，因为 gcc 不需要 a_upper * b_lower 的高半部分结果，反之亦然。它可以使用更快的 IMUL 2 操作数形式破坏输入的高半部分，因为它们只使用一次。

使用 -march=haswell 启用 BMI2，gcc 使用 MULX 来避免甚至是一个 MOV。

有时编译器输出并不完美，但一般策略通常是手动优化的良好起点。

当然，如果您真正首先想要的是 128 位乘法 C，只需使用编译器内置的- 支持它。这让优化器可以完成它的工作，通常会比你在 inline-asm 中编写几个部分给出更好的结果。 (https://gcc.gnu.org/wiki/DontUseInlineAsm).

Is there a 128 bit integer in gcc? 用于 GNU C unsigned __int128
https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=msvc-170 MSVC 的 _umul128 执行 64x64 => 128 位乘法（仅在 64 位 CPU 上）。将 args 作为 64 位的一半，returns 两半。
- 包括 MSVC 内部函数，但仍仅适用于 64 位 CPU。
An efficient way to do basic 128 bit integer calculations in C++?

如何在汇编中将两个十六进制 128 位数字相乘

How can I multiply two hex 128 bit numbers in assembly

algorithm

assembly

byte

x86-64

multiplication