向量化 itertools combination_with_replacement

Question

我的目标是使用我的 GPU 加速组合列表的创建。我怎样才能做到这一点？

例如，以下代码创建了一个包含 260 个文本字符串的列表，范围从“aa”到“jz”。然后我们使用 itertools combinations_with_replacement() 来创建此列表的 R 元素的所有可能组合。 timeit 的使用表明，超过 3 个元素时，提取这些组合的列表会呈指数级减慢。我怀疑这可以用 numba cuda 来完成，但我不知道怎么做。

import timeit
timeit.timeit('''

from itertools import combinations_with_replacement

combo_count = 2

alphabet = 'a'
alpha_list = []
item_list = []

for i in range(0,26):
    alpha_list.append(alphabet)
    alphabet = chr(ord(alphabet) + 1)

for first_letter in alpha_list[0:10]:
    for second_letter in alpha_list:
        item_list.append(first_letter+second_letter)

print("Length of item list:",len(item_list))
    
combos = combinations_with_replacement(item_list,combo_count)
cmb_lst = [bla for bla in combos]
print("Length of list of all {} combinations: {}".format(combo_count,len(cmb_lst)))
              ''', number=1)

Answer 1

如评论中所述，无法直接（使用 Numba CUDA）从 itertools 库“矢量化”combinations_with_replacement() 调用。 Numba CUDA doesn't work that way.

但是，我相信应该可以使用 Numba CUDA 生成等效的结果数据集，在某些情况下，它似乎运行比 itertools 库函数更快。我想可能有多种方法可以实现这一点，我并没有声称我描述的方法在任何方面都是最优的。它当然不是，而且肯定可以运行更快。然而，根据我的测试，即使是这种不是很优化的方法，在 V100 GPU 上运行比 python itertools 大约快 10 倍。

作为背景，我认为 this and this（或等效的 material）是必不可少的阅读材料。

根据以上，具有 k 个选项的 n 项的组合数公式为：

(n-1+k)!
--------   (Formula 1)
(n-1)!k!

在下面的代码中，我将上面的计算封装在count_comb_with_repl（设备）和host_count_comb_with_repl（主机）函数中。事实证明，我们可以使用这个基本计算，为 n 和 k 提供更小的级联选择序列，以驱动整个计算过程来计算仅给定最终结果索引的组合大批。为了形象化我们正在做的事情，有一个简单的例子会有所帮助。让我们以 3 个项目和 3 个选择为例。从零开始索引项目，可能性数组如下所示：

n = 3, k = 3
index    choices  first digit calculation
0        0,0,0    -----------------
1        0,0,1
2        0,0,2
3        0,1,1     equivalent to n = 3, k = 2
4        0,1,2
5        0,2,2    -----------------
6        1,1,1    -----------------
7        1,1,2     equivalent to n = 2, k = 2
8        1,2,2    -----------------
9        2,2,2     equivalent to n = 1, k = 2

上面列表的长度是通过将 n = 3 和 k = 3 的值代入公式 1 得到的。理解我提出的方法的关键观察是计算选择结果的第一个数字仅给出索引，我们可以计算 0 和 1 之间的分界点，例如通过观察考虑第一个选择索引为 0 的结果，该范围的长度通过插入值给出n = 3 和 k = 2 进入公式 1。因此，如果我们给定的索引小于此值 (6)，则我们知道第一个数字为 0。如果大于此值，则我们知道第一个数字是 1 或 2，通过适当的偏移，我们可以重新计算下一个范围（对应于 1 的第一个数字），看看我们的索引是否落在这个范围内。

一旦我们知道了第一个数字，我们就可以重复这个过程（通过适当的列表缩减和偏移）来找到下一个数字，下一个数字等等。

这里是实现上述方法的python代码。正如我提到的，对于 n=260 和 k=4 的测试用例，这个运行s 在我的 V100 上不到 3 秒。

$ cat t2.py
from numba import cuda,jit
import numpy as np

@cuda.jit(device=True)
def get_next_count_comb_with_repl(n,k,prev):
    return int(round((prev*(n))/(n+k)))

@cuda.jit(device=True)
def count_comb_with_repl(n,k):
    mymax = max(n-1,k)
    ans = 1.0
    cnt = 1
    for i in range(mymax+1, n+k):
        ans = ans*i/cnt
        cnt += 1
    return int(round(ans))
#intended to be identical to the previous function
#I just need a version I can call from host code
def host_count_comb_with_repl(n,k):
    mymax = max(n-1,k)
    ans = 1.0
    cnt = 1
    for i in range(mymax+1, n+k):
        ans = ans*i/cnt
        cnt += 1
    return int(round(ans))

@cuda.jit(device=True)
def find_first_digit(n,k,i):
    psum = 0
    count = count_comb_with_repl(n, k-1)
    if (i-psum) < count:
        return 0,psum
    psum += count
    for j in range(1,n):
        count = get_next_count_comb_with_repl(n-j,k-1,count)
        if (i-psum) < count:
            return j,psum
        psum += count
    return -1,0 # error

@cuda.jit
def kernel_count_comb_with_repl(n, k, l, r):
    for i in range(cuda.grid(1), l, cuda.gridsize(1)):
        new_ll = n
        new_cc = k
        new_i = i
        new_digit = 0
        for j in range(k):
            digit,psum = find_first_digit(new_ll, new_cc, new_i)
            new_digit += digit
            new_ll -= digit
            new_cc -= 1
            new_i  -= psum
            r[i+j*l] = new_digit

combo_count = 4
ll = 260
cl = host_count_comb_with_repl(ll, combo_count)
print(cl)
# bug if cl > 2G
if cl < 2**31:
    my_dtype = np.uint8
    if ll > 255:
        my_dtype = np.uint16
    r   = np.empty(cl*combo_count, dtype=my_dtype)
    d_r = cuda.device_array_like(r)
    block = 256
    grid = (cl//block)+1
    #grid = 640
    kernel_count_comb_with_repl[grid,block](ll, combo_count, cl, d_r)
    r = d_r.copy_to_host()
    print(r.reshape(combo_count,cl))
$ time python t2.py
194831715
[[  0   0   0 ... 258 258 259]
 [  0   0   0 ... 258 259 259]
 [  0   0   0 ... 259 259 259]
 [  0   1   2 ... 259 259 259]]

real    0m2.212s
user    0m1.110s
sys     0m1.077s
$

（上面的测试用例：n=260，k = 4，使用 OP 的代码在我的系统上花费大约 30 秒。）

这应该被认为是一个想法的草图。我没有声称它没有缺陷。这种类型的问题会很快耗尽 GPU 上的内存（对于 n and/or k 足够大的选择），而您唯一的指示可能是粗略的内存不足错误来自 numba.

是的，上面的代码不会产生 aa 到 jz 的串联，但这只是使用结果的索引练习。您将使用结果索引来索引您的项目数组，根据需要将 0,0,0,1 之类的结果转换为 aa、aa、aa 之类的结果ab

这不是全面的性能胜利。他们 python 方法对于较小的测试用例仍然更快，而较大的测试用例（例如 n = 260，k = 5）将超过 GPU 上的可用内存。

向量化 itertools combination_with_replacement

Vectorize itertools combination_with_replacement

python

vectorization

itertools

numba