伽罗瓦域 GF(4) 中乘法的常数时间

Question

Tl;dr：有没有办法以任何方式（包括多线程）改进下面的代码，因为代码将运行数千亿次？

到objective就是在伽罗华域GF(4)中找一个常数时间算法（没有一个for循环）进行乘法。我不确定这是否可行，但值得一试。

一些背景：GF(2) 或基数 2 中的乘法相当于 anding 两个值相乘。这是因为：

一个	b	a × b = a ∧ b
0	0	0
0	1	0
1	0	0
1	1	1

例如：

10101011010100 × 10011000101101 = 

10101011010100 
10011000101101 ∧
--------------
10001000000100

对于GF(4)，有四种不同的符号可以使用：0、1、2、3。不与乘法相同以 4 为基数，因为某些数字在乘以其他数字时未给出预期结果。它们在下面的 table 中以粗体显示：

一个	b	a×b
0	0	0
0	1	0
0	2	0
0	3	0
1	0	0
1	1	1
1	2	2
1	3	3
2	0	0
2	1	2
2	2	3
2	3	1
3	0	0
3	1	3
3	2	1
3	3	2

可以使用以下乘法 table 总结上述 table 的更紧凑形式：

×	1	2	3
0	0	0	0
1	1	2	3
2	2	3	1
3	3	1	2

我们可以将四个数字中的每一个都写成二进制，因为乘法将对值的二进制表示执行：

位数	二进制表示
0	00
1	01
2	10
3	11

在GF(4)中，乘法是通过不带进位的数字乘以数字来完成的。例如：

21302032 × 31012233 = 

21302032
31012233 ×
--------
11003021

我们可以使用值的二进制表示并执行乘法：

21302032 = 1001110010001110 in binary
31012233 = 1101000110101111 in binary
11003021 = 0101000011001001 in binary

1001110010001110
1101000110101111 ×
----------------
0101000011001001

最后，这里是执行乘法的工作 java 代码的实现。但是，它使用了一个for循环，目标是想出恒定时间算法：

public class Multiplication {
    public static void main(String[] args) {
        final byte[][] MUL_ARRAY = new byte[][]{
                {0, 0, 0, 0},
                {0, 1, 2, 3},
                {0, 2, 3, 1},
                {0, 3, 1, 2}
        };
        long mask;
        byte shift = 2;

        //long is 64 bits which means it can store 32 digits quaternary value.
        int  n      = 8;                 //# of quaternary digits (ALWAYS given)
        long v1     = 0b1001110010001110;//21302012 in base 4
        long v2     = 0b1101000110101111;//31012233 in base 4
        long result = 0;

        for (int i = 0; i < n; i++) {
            //get far-right quaternary digit of the two vectors:
            mask = 0b11;
            mask = mask << 2 * (n - i - 1);
            long v1DigitPadded = v1 & mask;//correct digit with zero padding
            long v2DigitPadded = v2 & mask;//correct digit with zero padding
            //shift the digits so that the digit needed is at far-right
            v1DigitPadded = v1DigitPadded >>> 2 * (n - i - 1);
            v2DigitPadded = v2DigitPadded >>> 2 * (n - i - 1);
            //The actual quaternary digit
            byte v1Digit     = (byte) v1DigitPadded;
            byte v2Digit     = (byte) v2DigitPadded;
            byte product     = MUL_ARRAY[v1Digit][v2Digit];
            long resultDigit = product << 2 * (n - i - 1);
            result = result | resultDigit;
        }
        //desired output: 0101000011001001
        //prints the value in binary with zeros padding on the left
        String s      = Long.toBinaryString(result);
        String output = String.format("%" + n * 2 + "s", s).replace(" ", "0");
        System.out.println("The output is: " + output);
    }
}

有算法吗？如果没有，是否有一些改进可以帮助我的逻辑（也许是一种有效的多线程方法）？

Answer 1

“可以使用公式（没有 for 循环）来获得结果吗？”问题的答案。是的。

这是一些代码。它在 Python 中，但它应该直接转换为 Java，无需进行重大修改。示例和解释遵循代码。它是假设 64 位整数编码 GF(4) 的 32 个元素而编写的——如果你想要 32 位整数，请使用 0x5555_5555 的 MASK 代替。

#: Mask to extract the low-order bit of each GF(4) element.
MASK = 0x5555_5555_5555_5555


def gf4_vector_multiply(i, j):
    """
    Vector multiply in GF(4).

    i and j are 64-bit integers, each of which encodes 32 elements
    of GF(4) in pairs of consecutive bits.

    The returned value represents the 32 products of the GF(4) elements,
    encoded as a 64-bit integer in the same way.
    """
    ii = i >> 1
    jj = j >> 1
    r0 = (ii & jj) ^ (i & j)
    r1 = (ii & j) ^ (jj & i) ^ (ii & jj)
    return ((r1 & MASK) << 1) ^ (r0 & MASK)

例子

将上面的函数应用到你给出的例子中，将 21302032 和 31012233 相乘得到 11003021:

>>> i = int("21302032", base=4)
>>> j = int("31012233", base=4)
>>> product = gf4_vector_multiply(i, j)
>>> product == int("11003021", base=4)
True

说明

我们分别计算每个产品的低位（上面代码中的r0）和高位（r1），然后将它们合并。

对于乘积的低位，我们有如下table（复制自问题中的table，但将2s和3s替换为0s和1s）：

×	1	2	3
0	0	0	0
1	1	0	1
2	0	1	1
3	1	1	0

这个table中的值是后面两个table中对应值的按位异或：

×	1	3
0	0	0
1	1	1
2	0	0
3	1	1

和

×	2	3
0	0	0
1	0	0
2	1	1
3	1	1

这两个 table 中的第一个只是输入的低位位的按位与（代码中的i & j），而第二个是按位与输入的高阶位（代码中的ii & jj）。所以 (ii & jj) ^ (i & j) 给了我们结果的低位。高位r1构造类似。

请注意，r0和r1的有趣部分在每对的低位，即r0 & 0x55555555和r1 & 0x55555555。每对（r0 & 0xaaaaaaaa 和 r1 & 0xaaaaaaaa）的高位没有用：它们的值包含混合连续 GF(4) 数字的结果，这不是我们想要的，所以我们在重新组装结果之前必须小心屏蔽掉这些位。

优化

这里我并没有尽力减少按位运算的数量；可能有一些减少位操作总数的余地。上面的代码使用了 14 个操作。这是一个不太清楚的变体，但使用了 12 个操作。请注意，使用正常的运算符优先级规则（在 Java 和 Python 中），除了其中一对括号外，其他所有都是多余的：我发现代码更具可读性带有括号，但如果您认为那样看起来更好，请随意将它们去掉。

def gf4_vector_multiply(i, j):
    """
    Vector multiply in GF(4).

    i and j are 64-bit integers, each of which encodes 32 elements
    of GF(4) in pairs of bits.

    The returned value represents the 32 products of the GF(4) elements,
    encoded in the same way.
    """
    ii = (i >> 1) & MASK
    jj = (j >> 1) & MASK
    return (((ii & j) ^ (jj & i)) << 1) ^ (ii & jj) ^ (i & j)

此外，正如您所建议的，如果您遇到令人尴尬的并行问题，那么使用多线程可能会有所帮助，但我认为这确实是对单独的 Stack Overflow 问题的调查。在现代 CPU 上，上面的代码很可能很简单，瓶颈将是从 RAM 中获取数据而不是计算。

特例 n = 1

评论里，题主问的是个位数乘法。对于个位数乘法，还有其他可用的技巧。这是一种可能性（同样在 Python 中，但同样应该直接转换为 Java）：

def gf4_scalar_multiply(i, j):
    return (0x813e4 >> 2*i*j) & 3

这里我们简单地将乘积值编码成魔法常量 0x813e4 中的位对，在基数 4 中是 2001033210。现在我们使用 shift 来查找给定整数乘积的适当基数 4 乘积。

让我们检查一下这是否给出了预期的乘法 table:

>>> for i in range(4):
...     for j in range(4):
...          print(gf4_scalar_multiply(i, j), end=" ")
...     print()
... 
0 0 0 0 
0 1 2 3 
0 2 3 1 
0 3 1 2

伽罗瓦域 GF(4) 中乘法的常数时间

Constant time for multiplication in Galois Field GF(4)

java

algorithm

binary

optimization

例子

说明

优化

特例 n = 1