优化我的 Cuda 内核以求和火炬张量内的不同索引范围

Question

我想编写一个 Cuda 内核来对数组中给定的（连续的）索引范围求和。例如，输入数组是 arr=[1]*10，我想要 3 个总和 - sum(arr[0:2]), sum(arr[2:3]), sum(arr[3:10])，所以输出应该是 [2, 1, 7].

我的数组是大型二维数组（因此我想对具有相同索引的每一行进行求和），维度通常约为 1,000 x 100,000，要求和的索引子范围变化很多（在 1 到 >1,000 之间）。这些数组已经作为 Pytorch 张量在 GPU 上，因此为了这个目的来回移动它们 to/from CPU 是昂贵的。

我编写了以下 Numba 内核（这里有一个最小的工作示例）。基本上，每个线程负责一个源列。它找到相关的目标列（w.r.t。索引范围）并将该列添加到目标。

from numba import cuda
import numpy as np

@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
    pos = cuda.grid(1)
    if pos>=arr.shape[1]: return
    for i in range(len(idxs)):
        if idxs[i]<=pos<idxs[i+1]:
            thread_idx = i
            break
    for i in range(arr.shape[0]):
        cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])

arr = np.ones(shape=(3, 10))
idxs = np.array([0, 2, 3, 10])
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1))
threads_per_block = 32
blocks_per_grid = ceil(arr.shape[1] / threads_per_block)
sum_idxs[threads_per_block, blocks_per_grid](arr, idxs, sum_arr)
print(sum_arr)

给出正确的结果

[[2. 1. 7.]
 [2. 1. 7.]
 [2. 1. 7.]]

并允许我根据需要将我的张量保留在 GPU 上。

（为了简单起见，我在这里使用了 numpy 数组。在我的代码中，我使用 cuda.as_cuda_array(tensor) 作为我的 pytorch 张量）

但是，这仍然是我代码的一个主要性能瓶颈，有没有办法进一步优化它？

Answer 1

这是一种可能的方法。通过每段使用一个块（或者在这种情况下，我们将每行使用一个块），通常可以相当有效地实现分段缩减。如果 segments/rows 的数量足够大，这会使 GPU 饱和。

我建议的代码设计将每行使用一个块，每个块将按顺序处理该行的 3 个段。为了处理一个段，该块将使用一个规范的 CUDA 缩减，使用一个块步幅循环来执行初始数据收集。

这是一个示例，修复了代码中评论中提到的一些问题（正确的网格尺寸标注，转换为 float32）：

$ cat t73.py
from numba import cuda,float32,int32
import numpy as np
import math
#TPB = threads per block, max of 1024
#TPB must be the power-of-2 expressed in TPBP2, i.e. TPB = 2**TPBP2    
TPB   = 1024
H     = TPB//2
TPBP2 = 10
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
    pos = cuda.grid(1)
    if pos>=arr.shape[1]: return
    for i in range(len(idxs)):
        if idxs[i]<=pos<idxs[i+1]:
            thread_idx = i
            break
    for i in range(arr.shape[0]):
        cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])

@cuda.jit
def sum_idxs_i(arr, idxs, sum_arr):
    s = cuda.shared.array(shape=(TPB), dtype=float32)
    tx = cuda.threadIdx.x
    row = cuda.blockIdx.x
#process each of the 3 segments in a row
    for j in range(3):
        lower = idxs[j]
        upper = idxs[j+1]
        val = float32(0)
#block-stride loop to gather data from the segment
        for i in range(tx+lower, upper, TPB):
            val += arr[row, i]
#canonical shared-memory parallel reduction
        s[tx] = val
        mid = H
        for i in range(TPBP2):
            cuda.syncthreads()
            if tx < mid:
                s[tx] += s[tx+mid]
            mid >>= 1
        if tx == 0:
            sum_arr[row, j] = s[0]

rows = 1000
cols = 100000
arr = np.ones(shape=(rows, cols),dtype=np.float32)
idxs = np.array([0, 2, 3, cols],dtype=np.int32)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = math.ceil(arr.shape[1] / TPB)
sum_idxs[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = (arr.shape[0])
sum_idxs_i[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
$ nvprof python t73.py
==4383== NVPROF is profiling process 4383, command: python t73.py
[[2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]
 ...
 [2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]]
[[2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]
 ...
 [2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]
 [2.0000e+00 1.0000e+00 9.9997e+04]]
==4383== Profiling application: python t73.py
==4383== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   45.92%  287.93ms         6  47.988ms  1.1520us  144.09ms  [CUDA memcpy HtoD]
                   44.88%  281.42ms         6  46.903ms  1.4720us  140.74ms  [CUDA memcpy DtoH]
                    8.46%  53.052ms         1  53.052ms  53.052ms  53.052ms  cudapy::__main__::sum_idxs1(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>)
                    0.75%  4.6729ms         1  4.6729ms  4.6729ms  4.6729ms  cudapy::__main__::sum_idxs_i2(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<double, int=2, C, mutable, aligned>)
      API calls:   43.26%  339.61ms         6  56.602ms  20.831us  193.89ms  cuMemcpyDtoH
                   36.75%  288.52ms         6  48.087ms  15.434us  144.35ms  cuMemcpyHtoD
                   18.66%  146.51ms         1  146.51ms  146.51ms  146.51ms  cuDevicePrimaryCtxRetain
                    0.93%  7.3083ms         5  1.4617ms  4.8120us  6.7314ms  cuMemFree
                    0.23%  1.8049ms         6  300.81us  9.4520us  778.85us  cuMemAlloc
                    0.04%  327.52us         2  163.76us  156.34us  171.19us  cuLinkAddData
                    0.04%  299.72us         2  149.86us  148.92us  150.80us  cuModuleLoadDataEx
                    0.04%  276.32us         2  138.16us  131.16us  145.16us  cuLinkComplete
                    0.02%  123.96us         2  61.978us  61.252us  62.704us  cuLinkCreate
                    0.01%  64.406us         2  32.203us  29.439us  34.967us  cuLaunchKernel
                    0.01%  63.184us         2  31.592us  30.251us  32.933us  cuDeviceGetName
                    0.00%  29.454us         1  29.454us  29.454us  29.454us  cuMemGetInfo
                    0.00%  20.732us        26     797ns     477ns  2.0320us  cuCtxGetCurrent
                    0.00%  12.852us        25     514ns     363ns  1.0920us  cuCtxGetDevice
                    0.00%  12.429us         2  6.2140us  1.7830us  10.646us  cuDeviceGetPCIBusId
                    0.00%  5.0950us        10     509ns     302ns  1.0770us  cuFuncGetAttribute
                    0.00%  3.9600us         2  1.9800us  1.8000us  2.1600us  cuModuleGetFunction
                    0.00%  3.5630us         2  1.7810us  1.7510us  1.8120us  cuLinkDestroy
                    0.00%  1.8970us         1  1.8970us  1.8970us  1.8970us  cuCtxPushCurrent
                    0.00%  1.8370us         4     459ns     226ns     697ns  cuDeviceGet
                    0.00%  1.6080us         6     268ns     181ns     481ns  cuDeviceGetAttribute
                    0.00%  1.5060us         3     502ns     230ns     795ns  cuDeviceGetCount
                    0.00%  1.2390us         2     619ns     428ns     811ns  cuDeviceComputeCapability
$

这段代码是运行在 GTX960 上，它恰好通过 bandwidthTest CUDA 示例代码报告了 ~84GB/s 的设备内存带宽。在上面的示例中，我们看到改进后的内核运行s 在 ~4.7 毫秒内（比原始原子内核快大约 10 倍），这转化为 (1000*100000*4)bytes/4.7ms ~= 85GB/s，因此我们可以针对此特定测试用例得出结论，这个内核大约是“最优”的。

优化我的 Cuda 内核以求和火炬张量内的不同索引范围

Optimizing my Cuda kernel to sum varying index ranges inside a torch tensor

cuda

numba

pytorch