优化C中的一个位译码操作

Question

我有一个按以下方式编码的无符号 32 位整数：

前 6 位定义 opcode
接下来的 8 位定义了一个 register
接下来的 18 位是二进制补码有符号整数 value。

我目前正在解码这个数字（uint32_t inst）使用：

const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const int32_t value = ((inst >> 17) & 0x01) ? -(131072 - (inst & 0x1FFFF)) : (inst & 0x1FFFF);

我可以在解码值时测量到显着的开销，我很确定这是由于用于比较符号和执行负运算的三元运算符（本质上是 if 语句）造成的。

有没有办法更快地进行值解码？

Answer 1

const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const uint32_t negative = ((inst >> 17) & 0x01);
const int32_t value =  -(negative * 131072 - (inst & 0x1FFFF));

当 negative 为 1 -(131072 - (inst & 0x1FFFF)) 且为 0 时：-(0 - (inst & 0x1FFFF)) 等于 inst & 0x1FFFF.

Answer 2

你的表达式比它需要的更复杂，尤其是在不必要地涉及三元运算符时。以下表达式在不涉及三元运算符的情况下为所有输入计算相同的结果。^* 它是一个很好的替代候选者，但与任何优化问题一样，测试是必不可少的：

const int32_t value = (int32_t)(inst & 0x1FFFF) - (int32_t)(inst & 0x20000);

或者@doynax 的类似建议的这种变体可能对优化器更友好：

const int32_t value = (int32_t)(inst & 0x3FFFF ^ 0x20000) - (int32_t)0x20000;

在每种情况下，强制转换都会避免实现定义的行为；在许多体系结构上，就机器代码而言，它们将是空操作。在那些架构上，这些表达式在所有情况下都比您的表达式涉及的操作更少，更不用说是无条件的了。

涉及移位的竞争替代方案也可能优化得很好，但所有这些替代方案都必然依赖于实现定义的行为，因为左移的整数溢出，负整数是右移的左侧操作数，和/或将超出范围的值转换为带符号的整数类型。您将不得不自己确定是否构成问题。

^* 由 GCC 4.4.7 为 x86_64 编译。原始表达式为某些输入调用实现定义的行为，因此在其他实现中，两个表达式可能会为这些输入计算不同的值。

Answer 3

标准（即使是不可移植的）做法是先左移，然后算术右移：

const int32_t temp = inst << 14; // "shift out" the 14 unneeded bits
const int32_t value = temp >> 14; // shift the number back; sign-extend

这涉及从 uint32_t 到 int32_t 的转换以及可能为负数 int32_t 的右移；这两个操作都是实现定义的，即不可移植（在 2 的补码系统上工作；几乎可以保证在任何体系结构上工作）。如果您想获得最佳性能并愿意依赖实现定义的行为，则可以使用此代码。

作为单个表达式：

const int32_t value = (int32_t)(inst << 14) >> 14;

注意：以下看起来更清晰，通常也可以工作，但涉及 undefined 行为（有符号整数溢出）：

const int32_t value = (int32_t)inst << 14 >> 14;

不要使用它！（即使您可能不会收到任何关于它的警告或错误）。

Answer 4

您可以考虑使用位域来简化您的代码。

typedef struct inst_type {
#ifdef MY_MACHINE_NEEDS_THIS
    uint32_t opcode :  6;
    uint32_t r1     :  8;
    int32_t  value  : 18;
#else
    int32_t  value  : 18;
    uint32_t r1     :  8;
    uint32_t opcode :  6;
#endif
} inst_type;

const uint32_t opcode = inst.opcode;
const uint32_t r1 = inst.r1;
const int32_t value = inst.value;

直接位操作通常性能更好，但并非总是如此。使用 John Bollinger 的回答作为基线，上述结构导致在 GCC 上提取三个感兴趣的值的指令减少了（但指令减少并不一定意味着更快）。

Answer 5

对于没有实现定义或未定义行为的理想编译器输出，使用@doynax 的 2 的补码解码表达式：

value = (int32_t)((inst & 0x3FFFF) ^ 0x20000) - (int32_t)0x20000;

强制转换确保我们正在做带符号的减法，而不是带环绕的无符号减法，然后将该位模式分配给带符号的整数。

这编译为 ARM 上的最佳 asm，其中 gcc 使用 sbfx r1, r1, #0, #18 (signed bitfield-extract) 将位 [17:0] 符号扩展为完整的 int32_t 寄存器。在 x86 上，它使用 shl 乘以 14 和 sar 乘以 14（算术移位）来做同样的事情。这是 gcc 识别 2 的补码模式并使用目标机器上最优化的任何东西对位域进行符号扩展的明显标志。

没有可移植的方法来确保位域按照您想要的方式排序。 gcc 似乎为小端目标从 LSB 到 MSB 排序位域，但为大端目标排序 MSB 到 LSB。您可以使用 #if 为 ARM with/without -mbig-endian 获得相同的 asm 输出，就像其他方法 but there's no guarantee that other compilers work the same.

如果 gcc/clang 没有看穿 xor 和 sub，那么值得考虑 <<14 / >>14 实现，它可以让编译器以这种方式进行操作.或者考虑使用 #if.

的 signed/unsigned 位域方法

但是既然我们可以从 gcc/clang 中得到理想的 asm 和完全安全且可移植的代码，我们就应该这样做。

请参阅 Godbolt Compiler Explorer 上的代码，了解大多数答案的版本。您可以查看 x86、ARM、ARM64 或 PowerPC 的 asm 输出。

// have to put the results somewhere, so the function doesn't optimize away
struct decode {
  //unsigned char opcode, r1;
  unsigned int opcode, r1;
  int32_t value;
};
// in real code you might return the struct by value, but there's less ABI variation when looking at the ASM this way (some would pack the struct into registers)

void decode_two_comp_doynax(struct decode *result, uint32_t inst) {
  result->opcode = ((inst >> 26) & 0x3F);
  result->r1 = (inst >> 18) & 0xFF;
  result->value = ((inst & 0x3FFFF) ^ 0x20000) - 0x20000;
}

# clang 3.7.1 -O3 -march=haswell   (enables BMI1 bextr)
    mov     eax, esi
    shr     eax, 26                     # grab the top 6 bits with a shift
    mov     dword ptr [rdi], eax
    mov     eax, 2066          # (0x812)# only AMD provides bextr r32, r32, imm.  Intel has to set up the constant separately
    bextr   eax, esi, eax               # extract the middle bitfield
    mov     dword ptr [rdi + 4], eax
    shl     esi, 14                     # <<14
    sar     esi, 14                     # >>14 (arithmetic shift)
    mov     dword ptr [rdi + 8], esi
    ret

优化C中的一个位译码操作

Optimize a bit decoding operation in C

c

performance

bit-manipulation