给定较低的字积，计算两个字的双字积（有符号）

Question

有一种算法可以计算 the double word product of two (signed) words。

函数muldws1使用四次乘法和五次加法计算两个字的双字。

在该代码的末尾，有一行被注释掉了

/* w[1] = u*v;                  // Alternative. */

这个替代方案使用五个乘法和四个加法，即它用一个加法代替一个乘法。

但我认为这种替代方法可以改进。我还没有说任何关于硬件的事情。让我们假设一个假设的 CPU 可以计算两个字乘积的低位字而不是高位字（例如，对于 32 位字 32x32 到低位 32）。在这种情况下，我觉得这个算法可以改进。这是我想出的假设 32 位字（相同的概念适用于 64 位字）。

void muldws1_improved(int w[], int32_t x, int32_t y) {
    uint16_t xl = x; int16_t xh = x >> 16;
    uint16_t yl = y; int16_t yh = y >> 16;

    uint32 lo = x*y;
    int32_t t = xl*yh + xh*yl;

    uint16_t tl = t; int16_t th = t >>16;
    uint16_t loh = lo >> 16;

    int32_t cy = loh<tl; //carry
    int32_t hi = xh*yh + th + cy;
    w[0] = hi; w[1] = lo;
}

这使用了四次乘法、三次加法和一次比较。这是一个比我希望的小的改进。

这个可以改进吗？有没有更好的方法来确定进位标志？ 我应该指出我还假设硬件没有进位标志（例如没有 ADDC 指令）但是可以比较单词（例如 word1<word）。

编辑：正如 Sander De Dycker 指出的那样，我的函数未通过单元测试。这是一个通过单元测试但效率较低的版本。我觉得还可以改进。

void muldws1_improved_v2(int w[], int32_t x, int32_t y) {
    uint16_t xl = x; int16_t xh = x >> 16;
    uint16_t yl = y; int16_t yh = y >> 16;

    uint32_t lo = x*y;
    int32_t  t2 = xl*yh;
    int32_t  t3 = xh*yl;
    int32_t  t4 = xh*yh;

    uint16_t t2l = t2; int16_t t2h = t2 >>16;
    uint16_t t3l = t3; int16_t t3h = t3 >>16;
    uint16_t loh = lo >> 16;

    uint16_t t = t2l + t3l;
    int32_t carry = (t<t2l) + (loh<t);
    int32_t hi = t4 + t2h + t3h + carry;
    w[0] = hi; w[1] = lo;
}

这使用了四次乘法、五次加法和两次比较，比原来的函数更差。

Answer 1

我的问题中 muldws1_improved 函数有两个问题。其中之一是当我 xl*yh + xh*yl 时它错过了进位。这就是它未通过单元测试的原因。 ~~但另一个是有签名*未签名的产品需要比 C 代码中看到的更多的机器逻辑。~~（请参阅下面我的编辑）。先I found a better solution which is to optimized the unsigned product function muldwu1再做

muldwu1(w,x,y);
w[0] -= ((x<0) ? y : 0)  + ((y<0) ? x : 0);

更正符号。

这是我尝试使用较低的词 lo = x*y 来改进 muldwu1（是的，这个函数通过了 Hacker's delight 的单元测试）。

void muldwu1_improved(uint32_t w[], uint32_t x, uint32_t y) {
    uint16_t xl = x; uint16_t xh = x >> 16;
    uint16_t yl = y; uint16_t yh = y >> 16;

    uint32_t lo   = x*y;    //32x32 to 32
    uint32_t t1   = xl*yh;  //16x16 to 32
    uint32_t t2   = xh*yl;  //16x16 to 32
    uint32_t t3   = xh*yh;  //16x16 to 32

    uint32_t t    = t1 + t2;
    uint32_t tl   = 0xFFFF & t;
    uint32_t th   = t >> 16;
    uint32_t loh  = lo >> 16;

    uint32_t cy   = ((t<t1) << 16) + (loh<tl); //carry
             w[1] = lo;
             w[0] = t3 + th + cy;
}

这比 Hacker's delight 的原始函数少了一个添加，但它必须进行两次比较

 1 mul32x32 to 32
 3 mul16x16 to 32
 4 add32
 5 shift logical (or shuffles)
 1 and
 2 compare32
***********
16 operations

编辑：

Hacker's Delight（第 2 版）中关于 mulhs 和 mulhu 算法的声明让我感到困扰。

The algorithm requires 16 basic RISC instructions in either the signed or unsigned version, four of which are multiplications.

我在中实现了无符号算法，但我的签名版本需要更多指令。我明白了原因，现在我可以回答我自己的问题了。

我没能在 Hacker's Delight 中找到更好的版本的原因是他们假设的 RISC 处理器有一条指令计算两个字乘积的低位字。 换句话说，他们的算法已经针对这种情况进行了优化，因此不太可能有比他们已有的更好的版本。

他们列出替代方案的原因是因为他们假设乘法（和除法）可能比其他指令更昂贵，因此他们将替代方案留作优化的案例。

所以 C 代码没有隐藏重要的机器逻辑。它假设机器可以将 word * word 降为 lower word。

为什么这很重要？在他们的算法中，他们首先

u0 = u >> 16;

以后

t = u0*v1 + k;

if u = 0x80000000 u0 = 0xffff8000。但是，如果您的 CPU 只能采用半字乘积来获得一个完整的字，则 u0 的上半字将被忽略，并且您会得到错误的签名结果。

在这种情况下，您应该计算无符号的高位字，然后使用我已经说过的 hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0); 进行更正。

我对此感兴趣的原因是英特尔的 SIMD 指令（SSE2 到 AVX2）没有执行 64x64 到 64 的指令，它们只有 32x32 到 64。这就是为什么我的签名版本需要更多指令。

但是AVX512有一条64x64转64的指令。因此，对于 AVX512，签名版本应该采用与未签名版本相同数量的指令。但是，由于 64x64 到 64 指令可能比 32x32 到 64 指令慢得多，因此无论如何执行无符号版本然后更正可能更有意义。

给定较低的字积，计算两个字的双字积（有符号）

Compute the doubleword product (signed) of two words given the lower word product

c

optimization