`__uint128_t` 上最有效的 popcount？

Question

我需要以最有效（最快）的方式 popcnt 大小为 128 位的无符号变量。

OS: Linux/Debian 9
编译器：GCC 8
CPU：英特尔 i7-5775C

尽管如果解决方案更多便携，甚至更好。

首先，GCC中有两种类型，分别是__uint128_t和unsigned __int128。我想他们最终是一样的，并且没有理由写丑陋的 unsigned __int128 东西，所以虽然它应该是新类型，但我更喜欢第一个，它更类似于标准 uint64_t。此外，英特尔有 __uint128_t，这是使用它的另一个原因（便携性）。

我写了下面的代码：

#include <nmmintrin.h>
#include <stdint.h>

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const uint64_t      n_hi    = n >> 64;
    const uint64_t      n_lo    = n;
    const uint_fast8_t  cnt_hi  = _mm_popcnt_u64(n_hi);
    const uint_fast8_t  cnt_lo  = _mm_popcnt_u64(n_lo);
    const uint_fast8_t  cnt     = cnt_hi + cnt_lo;

    return  cnt;
}

这是绝对最快的选择吗？

编辑：

我想到了另一个选项，它可能（或不）更快：

#include <nmmintrin.h>
#include <stdint.h>

union   Uint128 {
    __uint128_t uu128;
    uint64_t    uu64[2];
};

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const union Uint128 n_u     = {.uu128   = n};
    const uint_fast8_t  cnt_a   = _mm_popcnt_u64(n_u.uu64[0]);
    const uint_fast8_t  cnt_b   = _mm_popcnt_u64(n_u.uu64[1]);
    const uint_fast8_t  cnt     = cnt_a + cnt_b;

    return  cnt;
}

这样，虽然我不知道它是否合法（是吗？（编辑：它是：）），但我会避免转变。

Answer 1

使用 GCC 和 clang，如果您删除 static inline，您的两个函数将编译为相同的 asm，并且可能会等效地内联。

我建议使用 unsigned，因为 sizeof(uint_fast8_t) = 1 on x86-64 Linux。 _fast 类型回避了问题 "fast for what purpose"； fast8 适用于数组中的紧凑存储，fast32 是一种 64 位类型，可以避免重做符号或指针数学的零扩展，但会浪费数组中的 space。

clang 知道两个 popcnt 结果的总和适合 8 位整数而不会溢出，因此它可以优化掉零扩展，即使你将结果加到 unsigned 计数器中，但 gcc 不会't。（例如，将 return 类型更改为 unsigned，您将获得额外的 movzx eax, dil 指令。）硬件 popcnt 指令产生的结果正确地零扩展为 64-位，但分配给 uint_fast8_t 又名 uint8_t 是明确要求编译器将结果截断为 8 位。

x86-64 System V ABI 允许 args 和 return 值中的高位垃圾，因此当 return 类型较窄时，函数的独立版本可以允许进位到高位EAX 位。

I would avoid the shift.

移位仅存在于 C 源代码中。在 asm 中，high/low 一半将存储在单独的 64 位寄存器或单独的内存源操作数中。

来自the Godbolt compiler explorer

# gcc8.3 -O3 -march=haswell  for the union and the shift version
popcnt_u128:
    xor     eax, eax    # break popcnt's false dependency on Intel CPUs
    popcnt  rsi, rsi    # _mm_popcnt_u64(n_hi);
    popcnt  rax, rdi    # popcnt(lo)
    add     eax, esi        # clang uses add al,cl  and doesn't avoid false deps except in a loop
    ret                 # return value in AL (low 8 bits of EAX)

GCC 可以通过就地执行两个 popcnts 并使用 lea eax, [rdi + rsi] 来避免异或归零。但是你说了一些关于数组的事情，所以如果数据来自内存，那么 GCC 通常会 mov-load 然后 popcnt 到位以避免错误的依赖。 (Why does breaking the "output dependency" of LZCNT matter?) 或者实际上，它会将目标异或零，然后使用内存源 popcnt，这可能会稍微小一些代码大小。

I don't trust __builtin_popcountll because it uses long long instead of uint64_t. I think it is insane to create a function that deals with bits and uses a type that isn't of fixed width. I don't know what GCC people were thinking about.

实际使用unsigned long long，未签名long long；那会很疯狂。

unsigned long long是至少64位，uint64_t要求正好是64位。（事实上，仅存在于具有恰好 64 位且无填充的类型的 C 实现中；对它的支持是可选的）。我不确定 GNU C 是否支持 unsigned long long 不是 64 位或 uint64_t 不可用的任何目标。甚至 int64_t，也要求是 2 的补码。（如果 GCC 支持任何非 2 的补码目标，则 IDK。）

您可以将输入转换为 uint64_t 以确保没有设置更高的位。从 uint64_t 到 unsigned long long 的隐式转换不会设置任何额外的位，即使在 ULL 比 64 位宽的平台上也是如此。

例如__builtin_popcountll( (uint64_t)n ); 将始终安全地计算 n 的低 64 位，而不管 unsigned long long.

的宽度如何

I'm using a very big static array. Do I have to care about cache, or does GCC handle that for me? I thought that was only a problem with malloc and that stuff. GCC knows the array at compile time, so it can do that better than me.

GCC 将（几乎？）永远不会重新安排您的循环来改变内存访问模式。静态数组与 malloced 内存没有本质区别；他们不会免费在缓存中保持热度。请参阅 What Every Programmer Should Know About Memory? 了解更多信息。

但是，如果您只是按顺序循环遍历内存并对整个数组进行弹出计数，那么是否使用 __uint128_t 并不重要。

clang 将使用 AVX2 vpshufb（作为半字节 LUT）在数组上自动矢量化 __builtin_popcntll 或 _mm_popcnt_u64，这在包括 Broadwell 在内的英特尔 CPU 上非常有用。参见

但不幸的是，对 __uint128_t 的数组使用包装函数会失败。查看 Godbolt link.

中的最后两个函数

`__uint128_t` 上最有效的 popcount？

Most efficient popcount on `__uint128_t`?

c

gcc

x86-64

intel

micro-optimization