尽可能快地比较 (a + sqrt(b)) 形式的两个值？

Question

作为我正在编写的程序的一部分，我需要比较 a + sqrt(b) 形式的两个值，其中 a 和 b 是无符号整数。由于这是紧密循环的一部分，我希望尽快与运行进行比较。（如果重要的话，我运行在 x86-64 机器上编写代码，无符号整数不大于 10^6。另外，我知道一个事实 a1<a2。）

作为一个独立的功能，这是我要优化的。我的数字是足够小的整数，double（甚至 float）可以准确表示它们，但 sqrt 结果中的舍入误差不能改变结果。

// known pre-condition: a1 < a2  in case that helps
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    return a1+sqrt(b1) < a2+sqrt(b2);  // computed mathematically exactly
}

测试用例：is_smaller(900000, 1000000, 900001, 998002) 应该 return 为真，但正如@wim 的评论中所示，用 sqrtf() 计算它会 return 错误。 (int)sqrt() 到 t运行也会返回整数。

a1+sqrt(b1) = 90100 和 a2+sqrt(b2) = 901000.00050050037512481206。最接近的浮点数正好是 90100。

由于 sqrt() 函数在完全内联为 sqrtsd 指令时即使在现代 x86-64 上通常也相当昂贵，我尽量避免调用 sqrt()可能。

通过平方去除 sqrt 也可能通过使所有计算精确来避免任何舍入错误的危险。

如果函数是这样的...

bool is_smaller(unsigned a1, unsigned b1, unsigned x) {
    return a1+sqrt(b1) < x;
}

...然后我可以做 return x-a1>=0 && static_cast<uint64_t>(x-a1)*(x-a1)>b1;

但是现在因为有两个 sqrt(...) 项，我不能做同样的代数运算。

我可以使用以下公式将值 平方两次 ：

      a1 + sqrt(b1) = a2 + sqrt(b2)
<==>  a1 - a2 = sqrt(b2) - sqrt(b1)
<==>  (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1) * sqrt(b2)
<==>  (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1 * b2)
<==>  (a1 - a2) * (a1 - a2) - (b1 + b2) = - 2 * sqrt(b1 * b2)
<==>  ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 2 = sqrt(b1 * b2)
<==>  ((b1 + b2) - (a1 - a2) * (a1 - a2)) * ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 4 = b1 * b2

无符号除以 4 很便宜，因为它只是一个位移位，但由于我将数字平方两次，所以我需要使用 128 位整数，并且我需要引入一些 >=0 检查（因为我比较的是不平等而不是平等）。

感觉可能有一种方法可以更快地解决这个问题，方法是对这个问题应用更好的代数。有没有办法更快地做到这一点？

Answer 1

我累了，可能犯了一个错误；但我敢肯定，如果我做了，有人会指出来的..

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    a_diff = a1-a2;   // May be negative

    if(a_diff < 0) {
        if(b1 < b2) {
            return true;
        }
        temp = a_diff+sqrt(b1);
        if(temp < 0) {
            return true;
        }
        return temp*temp < b2;
    } else {
        if(b1 >= b2) {
            return false;
        }
    }
//  return a_diff+sqrt(b1) < sqrt(b2);

    temp = a_diff+sqrt(b1);
    return temp*temp < b2;
}

如果你知道 a1 < a2 那么它可能会变成：

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    a_diff = a2-a1;    // Will be positive

    if(b1 > b2) {
        return false;
    }
    if(b1 >= a_diff*a_diff) {
        return false;
    }
    temp = a_diff+sqrt(b2);
    return b1 < temp*temp;
}

Answer 2

这是一个没有 sqrt 的版本，但我不确定它是否比只有一个 sqrt 的版本快（这可能取决于值的分布）。

计算公式如下（如何删除两个 sqrt）：

ad = a2-a1
bd = b2-b1

a1+sqrt(b1) < a2+sqrt(b2)              // subtract a1
   sqrt(b1) < ad+sqrt(b2)              // square it
        b1  < ad^2+2*ad*sqrt(b2)+b2    // arrange
   ad^2+bd  > -2*ad*sqrt(b2)

这里，右边总是负数。如果左边是正数，那么我们必须 return true.

如果左边是负数，那么我们可以对不等式求平方：

ad^4+bd^2+2*bd*ad^2 < 4*ad^2*b2

这里要注意的关键是，如果a2>=a1+1000，那么is_smaller总是returns true（因为sqrt(b1)的最大值是1000）。如果 a2<=a1+1000，那么 ad 是一个小数，所以 ad^4 总是适合 64 位（不需要 128 位运算）。这是代码：

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    int ad = a2 - a1;
    if (ad>1000) {
        return true;
    }

    int bd = b2 - b1;
    if (ad*ad+bd>0) {
        return true;
    }

    int ad2 = ad*ad;

    return (long long int)ad2*ad2 + (long long int)bd*bd + 2ll*bd*ad2 < 4ll*ad2*b2;
}

编辑：正如 Peter Cordes 所注意到的，第一个 if 不是必需的，因为第二个 if 处理它，所以代码变得更小更快：

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    int ad = a2 - a1;
    int bd = b2 - b1;
    if ((long long int)ad*ad+bd>0) {
        return true;
    }

    int ad2 = ad*ad;
    return (long long int)ad2*ad2 + (long long int)bd*bd + 2ll*bd*ad2 < 4ll*ad2*b2;
}

Answer 3

也有计算整数平方的牛顿法described here 另一种方法是不计算平方根，而是通过二进制搜索来搜索 floor(sqrt(n)) ...有 "only" 1000 个小于 10^6 的完整平方数。这可能性能不佳，但会是一种有趣的方法。我没有测量过这些，但这里有一些例子：

#include <iostream>
#include <array>
#include <algorithm>        // std::lower_bound
#include <cassert>          


bool is_smaller_sqrt(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    return a1 + sqrt(b1) < a2 + sqrt(b2);
}

static std::array<int, 1001> squares;

template <typename C>
void squares_init(C& c)
{
    for (int i = 0; i < c.size(); ++i)
        c[i] = i*i;
}

inline bool greater(const int& l, const int& r)
{
    return r < l;
}

inline bool is_smaller_bsearch(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    // return a1 + sqrt(b1) < a2 + sqrt(b2)

    // find floor(sqrt(b1)) - binary search withing 1000 elems
    auto it_b1 = std::lower_bound(crbegin(squares), crend(squares), b1, greater).base();

    // find floor(sqrt(b2)) - binary search withing 1000 elems
    auto it_b2 = std::lower_bound(crbegin(squares), crend(squares), b2, greater).base();

    return (a2 - a1) > (it_b1 - it_b2);
}

unsigned int sqrt32(unsigned long n)
{
    unsigned int c = 0x8000;
    unsigned int g = 0x8000;

    for (;;) {
        if (g*g > n) {
            g ^= c;
        }

        c >>= 1;

        if (c == 0) {
            return g;
        }

        g |= c;
    }
}

bool is_smaller_sqrt32(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    return a1 + sqrt32(b1) < a2 + sqrt32(b2);
}

int main()
{
    squares_init(squares);

    // now can use is_smaller
    assert(is_smaller_sqrt(1, 4, 3, 1) == is_smaller_sqrt32(1, 4, 3, 1));
    assert(is_smaller_sqrt(1, 2, 3, 3) == is_smaller_sqrt32(1, 2, 3, 3));
    assert(is_smaller_sqrt(1000, 4, 1001, 1) == is_smaller_sqrt32(1000, 4, 1001, 1));
    assert(is_smaller_sqrt(1, 300, 3, 200) == is_smaller_sqrt32(1, 300, 3, 200));
}

Answer 4

可能并不比其他答案好，但使用了不同的想法（和大量 pre-analysis）。

// Compute approximate integer square root of input in the range [0,10^6].
// Uses a piecewise linear approximation to sqrt() with bounded error in each piece:
//   0 <= x <= 784 : x/28
//   784 < x <= 7056 : 21 + x/112
//   7056 < x <= 28224 : 56 + x/252
//   28224 < x <= 78400 : 105 + x/448
//   78400 < x <= 176400 : 168 + x/700
//   176400 < x <= 345744 : 245 + x/1008
//   345744 < x <= 614656 : 336 + x/1372
//   614656 < x <= 1000000 : (784000+x)/1784
// It is the case that sqrt(x) - 7.9992711366390365897... <= pseudosqrt(x) <= sqrt(x).
unsigned pseudosqrt(unsigned x) {
    return 
        x <= 78400 ? 
            x <= 7056 ?
                x <= 764 ? x/28 : 21 + x/112
              : x <= 28224 ? 56 + x/252 : 105 + x/448
          : x <= 345744 ?
                x <= 176400 ? 168 + x/700 : 245 + x/1008
              : x <= 614656 ? 336 + x/1372 : (x+784000)/1784 ;
}

// known pre-conditions: a1 < a2, 
//                  0 <= b1 <= 1000000
//                  0 <= b2 <= 1000000
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
// Try three refinements:
// 1: a1 + sqrt(b1) <= a1 + 1000, 
//    so is a1 + 1000 < a2 ?  
//    Convert to a2 - a1 > 1000 .
// 2: a1 + sqrt(b1) <= a1 + pseudosqrt(b1) + 8 and
//    a2 + pseudosqrt(b2) <= a2 + sqrt(b2), 
//    so is  a1 + pseudosqrt(b1) + 8 < a2 + pseudosqrt(b2) ?
//    Convert to a2 - a1 > pseudosqrt(b1) - pseudosqrt(b2) + 8 .
// 3: Actually do the work.
//    Convert to a2 - a1 > sqrt(b1) - sqrt(b2)
// Use short circuit evaluation to stop when resolved.
    unsigned ad = a2 - a1;
    return (ad > 1000)
           || (ad > pseudosqrt(b1) - pseudosqrt(b2) + 8)
           || ((int) ad > (int)(sqrt(b1) - sqrt(b2)));
}

（我手头没有编译器，所以这可能包含一两个错字。）

Answer 5

我不确定代数运算是否与整数结合算术，必然导致最快的解决方案。你需要在这种情况下有很多标量乘法（不是很快），and/or 分支预测可能会失败，这可能会降低性能。显然，您必须进行基准测试以查看在您的特定情况下哪种解决方案最快。

一种制作方法 sqrt 快一点是将 -fno-math-errno 选项添加到 gcc 或 clang。在那种情况下，编译器不必检查负输入。对于 icc，这是默认设置。

通过使用矢量化，可以进一步提高性能 sqrt 指令 sqrtpd，而不是标量 sqrt 指令 sqrtsd。 Peter Cordes clang 能够自动矢量化此代码，这样它就会生成这个 sqrtpd.

然而，自动矢量化的成功在很大程度上取决于正确的编译器设置以及使用的编译器（clang、gcc、icc 等）。对于 -march=nehalem 或更早的版本，clang 不会矢量化。

使用以下内部函数代码可以获得更可靠的矢量化结果，请参见下文。为了可移植性，我们只假设支持 SSE2，这是 x86-64 基线。

/* gcc -m64 -O3 -fno-math-errno smaller.c                      */
/* Adding e.g. -march=nehalem or -march=skylake might further  */
/* improve the generated code                                  */
/* Note that SSE2 in guaranteed to exist with x86-64           */
#include<immintrin.h>
#include<math.h>
#include<stdio.h>
#include<stdint.h>

int is_smaller_v5(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    uint64_t a64    =  (((uint64_t)a2)<<32) | ((uint64_t)a1); /* Avoid too much port 5 pressure by combining 2 32 bit integers in one 64 bit integer */
    uint64_t b64    =  (((uint64_t)b2)<<32) | ((uint64_t)b1); 
    __m128i ax      = _mm_cvtsi64_si128(a64);         /* Move integer from gpr to xmm register                  */
    __m128i bx      = _mm_cvtsi64_si128(b64);         
    __m128d a       = _mm_cvtepi32_pd(ax);            /* Convert 2 integers to double                           */
    __m128d b       = _mm_cvtepi32_pd(bx);            /* We don't need _mm_cvtepu32_pd since a,b < 1e6          */
    __m128d sqrt_b  = _mm_sqrt_pd(b);                 /* Vectorized sqrt: compute 2 sqrt-s with 1 instruction   */
    __m128d sum     = _mm_add_pd(a, sqrt_b);
    __m128d sum_lo  = sum;                            /* a1 + sqrt(b1) in the lower 64 bits                     */
    __m128d sum_hi  =  _mm_unpackhi_pd(sum, sum);     /* a2 + sqrt(b2) in the lower 64 bits                     */
    return _mm_comilt_sd(sum_lo, sum_hi);
}


int is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    return a1+sqrt(b1) < a2+sqrt(b2);
}


int main(){
    unsigned a1; unsigned b1; unsigned a2; unsigned b2;
    a1 = 11; b1 = 10; a2 = 10; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 11; a2 = 10; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 10; a2 = 11; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 10; a2 = 10; b2 = 11;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));

    return 0;
}

有关生成的程序集，请参阅 this Godbolt link。

在使用编译器选项 gcc -m64 -O3 -fno-math-errno -march=nehalem 对 Intel Skylake 进行的简单吞吐量测试中，我发现了一个吞吐量 is_smaller_v5() 比原来的 is_smaller() 好 2.6 倍：6.8 cpu 周期对比 18 cpu 周期，包括循环开销。然而，在一个（也是？）简单的延迟测试，其中输入 a1, a2, b1, b2 取决于之前 is_smaller(_v5) 的结果，我没有看到任何改进。（39.7 个周期对 39 个周期）。

尽可能快地比较 (a + sqrt(b)) 形式的两个值？

Comparing two values in the form (a + sqrt(b)) as fast as possible?

c++

optimization

algebra

micro-optimization

sqrt