_umul128 在 Windows 32 位

Question

在 Visual C++ 中，当面向 32 位 Windows 时，_umul128 未定义。

针对 Win32 时，如何将两个无符号 64 位整数相乘？该解决方案只需要在面向 32 位 Windows.

的 Visual C++ 2017 上运行

Answer 1

我找到了以下代码（来自 xmrrig），它似乎可以很好地完成工作：

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Answer 2

这个答案有另一个答案的 xmrrig function 版本，针对 MSVC 32 位模式进行了优化。 原始版本适用于其他编译器，尤其是 clang。

我查看了@Augusto 函数的 MSVC 输出，它真的很糟糕。 对 32x32 => 64b 乘法使用 __emulu 显着改善了它（因为 MSVC 是愚蠢的并且不会针对输入已知的情况优化 uint64_t * uint64_t = uint64_t只能是 32 位，上半部分为零）。其他编译器（gcc 和 clang）生成单个 mul 指令而不是调用辅助函数。 MSVC 的代码生成还有其他问题，但我不知道如何通过调整源代码来解决。我认为如果你想在该编译器上获得良好的性能，你将不得不使用内联 asm（或单独编译的 asm 函数）。

如果您需要更灵活的任意精度（更大的数字），请参阅具有 asm 实现的 GMPlib's low-level functions，而不是尝试从这个 __umul128 构建 256b 乘法。但是，如果您确实需要这个，那么值得尝试。坚持使用 C++ 可以实现常量传播和 CSE 优化，这是 asm 无法实现的。

clang 编译这个没有大问题，实际上使用 adc 进行所有加法运算（除了用 setc 指令保存的指令）。 MSVC 对进位检查进行分支，只会生成令人讨厌的代码。 GCC 也做得不好，有一些分支。（因为 gcc 不知道如何将 carry = sum < a 变成 adc、gcc bug 79173。）

IDK 如果 MSVC 或 gcc 支持 32 位模式下的 64 位整数的任何 add-with-carry 内在函数。 _addcarry_u64 generates poor code with gcc anyway (in 64-bit mode)，但 ICC 可能还行。关于 MSVC 的 IDK。

如果您想要一个 asm 实现，我建议您使用此函数的 clang 5.0 输出。您可能会手动找到一些优化，但它肯定比 MSVC 更好。但是当然 https://gcc.gnu.org/wiki/DontUseInlineAsm 中的大部分论点都适用：如果你乘以内联变成常量的东西，或者乘以已知上半部分为零的数字，那么阻止常量传播是一个主要问题。

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

漂亮的代码，有 clang。 MSVC 的代码有点糟糕，但比以前更好。 gcc 也有点糟糕（与其他答案相比没有变化）。

#include <stdint.h>

#ifdef _MSC_VER
#  include <intrin.h>
#else
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic
// But good compilers can just use this to get a single mul instruction
static inline
uint64_t __emulu(uint32_t x, uint32_t y) {
     return x * (uint64_t)y;
}
#endif

// This is still pretty ugly with MSVC, branching on the carry
//  and using XMM store / integer reload to zero a register!
// But at least it inlines 4 mul instructions
//  instead of calling a generic 64x64 => 64b multiply helper function
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF;

    //uint64_t ac = __emulu(a, c);
    uint64_t ad = __emulu(a, d);
    //uint64_t bc = __emulu(b, c);
    uint64_t bd = __emulu(b, d);

    uint64_t adbc = ad + __emulu(b , c);
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0;
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0;
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

确保只在 32 位代码中使用它。在 64 位代码中，它无法优化为单个 64 位 mul 指令（产生完整结果的两个 64 位一半）。实现 GNU C++ 扩展（clang、gcc、ICC）的编译器可以使用 unsigned __int128 并获得好的代码。例如a * (unsigned __int128)b 产生 128b 结果。（Godbolt 上的示例）。

_umul128 在 Windows 32 位

_umul128 on Windows 32 bits

x86

biginteger

multiplication

intrinsics

visual-c++