优化我的 Cuda 内核以求和火炬张量内的不同索引范围
Optimizing my Cuda kernel to sum varying index ranges inside a torch tensor
我想编写一个 Cuda 内核来对数组中给定的(连续的)索引范围求和。例如,输入数组是 arr=[1]*10
,我想要 3 个总和 - sum(arr[0:2]), sum(arr[2:3]), sum(arr[3:10])
,所以输出应该是 [2, 1, 7]
.
我的数组是大型二维数组(因此我想对具有相同索引的每一行进行求和),维度通常约为 1,000 x 100,000,要求和的索引子范围变化 很多(在 1 到 >1,000 之间)。这些数组已经作为 Pytorch 张量在 GPU 上,因此为了这个目的来回移动它们 to/from CPU 是昂贵的。
我编写了以下 Numba 内核(这里有一个最小的工作示例)。基本上,每个线程负责一个源列。它找到相关的目标列(w.r.t。索引范围)并将该列添加到目标。
from numba import cuda
import numpy as np
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
arr = np.ones(shape=(3, 10))
idxs = np.array([0, 2, 3, 10])
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1))
threads_per_block = 32
blocks_per_grid = ceil(arr.shape[1] / threads_per_block)
sum_idxs[threads_per_block, blocks_per_grid](arr, idxs, sum_arr)
print(sum_arr)
给出正确的结果
[[2. 1. 7.]
[2. 1. 7.]
[2. 1. 7.]]
并允许我根据需要将我的张量保留在 GPU 上。
(为了简单起见,我在这里使用了 numpy 数组。在我的代码中,我使用 cuda.as_cuda_array(tensor)
作为我的 pytorch 张量)
但是,这仍然是我代码的一个主要性能瓶颈,有没有办法进一步优化它?
这是一种可能的方法。通过每段使用一个块(或者在这种情况下,我们将每行使用一个块),通常可以相当有效地实现分段缩减。如果 segments/rows 的数量足够大,这会使 GPU 饱和。
我建议的代码设计将每行使用一个块,每个块将按顺序处理该行的 3 个段。为了处理一个段,该块将使用一个规范的 CUDA 缩减,使用一个块步幅循环来执行初始数据收集。
这是一个示例,修复了代码中评论中提到的一些问题(正确的网格尺寸标注,转换为 float32
):
$ cat t73.py
from numba import cuda,float32,int32
import numpy as np
import math
#TPB = threads per block, max of 1024
#TPB must be the power-of-2 expressed in TPBP2, i.e. TPB = 2**TPBP2
TPB = 1024
H = TPB//2
TPBP2 = 10
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
@cuda.jit
def sum_idxs_i(arr, idxs, sum_arr):
s = cuda.shared.array(shape=(TPB), dtype=float32)
tx = cuda.threadIdx.x
row = cuda.blockIdx.x
#process each of the 3 segments in a row
for j in range(3):
lower = idxs[j]
upper = idxs[j+1]
val = float32(0)
#block-stride loop to gather data from the segment
for i in range(tx+lower, upper, TPB):
val += arr[row, i]
#canonical shared-memory parallel reduction
s[tx] = val
mid = H
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] += s[tx+mid]
mid >>= 1
if tx == 0:
sum_arr[row, j] = s[0]
rows = 1000
cols = 100000
arr = np.ones(shape=(rows, cols),dtype=np.float32)
idxs = np.array([0, 2, 3, cols],dtype=np.int32)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = math.ceil(arr.shape[1] / TPB)
sum_idxs[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = (arr.shape[0])
sum_idxs_i[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
$ nvprof python t73.py
==4383== NVPROF is profiling process 4383, command: python t73.py
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
==4383== Profiling application: python t73.py
==4383== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 45.92% 287.93ms 6 47.988ms 1.1520us 144.09ms [CUDA memcpy HtoD]
44.88% 281.42ms 6 46.903ms 1.4720us 140.74ms [CUDA memcpy DtoH]
8.46% 53.052ms 1 53.052ms 53.052ms 53.052ms cudapy::__main__::sum_idxs1(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>)
0.75% 4.6729ms 1 4.6729ms 4.6729ms 4.6729ms cudapy::__main__::sum_idxs_i2(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<double, int=2, C, mutable, aligned>)
API calls: 43.26% 339.61ms 6 56.602ms 20.831us 193.89ms cuMemcpyDtoH
36.75% 288.52ms 6 48.087ms 15.434us 144.35ms cuMemcpyHtoD
18.66% 146.51ms 1 146.51ms 146.51ms 146.51ms cuDevicePrimaryCtxRetain
0.93% 7.3083ms 5 1.4617ms 4.8120us 6.7314ms cuMemFree
0.23% 1.8049ms 6 300.81us 9.4520us 778.85us cuMemAlloc
0.04% 327.52us 2 163.76us 156.34us 171.19us cuLinkAddData
0.04% 299.72us 2 149.86us 148.92us 150.80us cuModuleLoadDataEx
0.04% 276.32us 2 138.16us 131.16us 145.16us cuLinkComplete
0.02% 123.96us 2 61.978us 61.252us 62.704us cuLinkCreate
0.01% 64.406us 2 32.203us 29.439us 34.967us cuLaunchKernel
0.01% 63.184us 2 31.592us 30.251us 32.933us cuDeviceGetName
0.00% 29.454us 1 29.454us 29.454us 29.454us cuMemGetInfo
0.00% 20.732us 26 797ns 477ns 2.0320us cuCtxGetCurrent
0.00% 12.852us 25 514ns 363ns 1.0920us cuCtxGetDevice
0.00% 12.429us 2 6.2140us 1.7830us 10.646us cuDeviceGetPCIBusId
0.00% 5.0950us 10 509ns 302ns 1.0770us cuFuncGetAttribute
0.00% 3.9600us 2 1.9800us 1.8000us 2.1600us cuModuleGetFunction
0.00% 3.5630us 2 1.7810us 1.7510us 1.8120us cuLinkDestroy
0.00% 1.8970us 1 1.8970us 1.8970us 1.8970us cuCtxPushCurrent
0.00% 1.8370us 4 459ns 226ns 697ns cuDeviceGet
0.00% 1.6080us 6 268ns 181ns 481ns cuDeviceGetAttribute
0.00% 1.5060us 3 502ns 230ns 795ns cuDeviceGetCount
0.00% 1.2390us 2 619ns 428ns 811ns cuDeviceComputeCapability
$
这段代码是 运行 在 GTX960 上,它恰好通过 bandwidthTest
CUDA 示例代码报告了 ~84GB/s 的设备内存带宽。在上面的示例中,我们看到改进后的内核 运行s 在 ~4.7 毫秒内(比原始原子内核快大约 10 倍),这转化为 (1000*100000*4)bytes/4.7ms ~= 85GB/s
,因此我们可以针对此特定测试用例得出结论,这个内核大约是“最优”的。
我想编写一个 Cuda 内核来对数组中给定的(连续的)索引范围求和。例如,输入数组是 arr=[1]*10
,我想要 3 个总和 - sum(arr[0:2]), sum(arr[2:3]), sum(arr[3:10])
,所以输出应该是 [2, 1, 7]
.
我的数组是大型二维数组(因此我想对具有相同索引的每一行进行求和),维度通常约为 1,000 x 100,000,要求和的索引子范围变化 很多(在 1 到 >1,000 之间)。这些数组已经作为 Pytorch 张量在 GPU 上,因此为了这个目的来回移动它们 to/from CPU 是昂贵的。
我编写了以下 Numba 内核(这里有一个最小的工作示例)。基本上,每个线程负责一个源列。它找到相关的目标列(w.r.t。索引范围)并将该列添加到目标。
from numba import cuda
import numpy as np
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
arr = np.ones(shape=(3, 10))
idxs = np.array([0, 2, 3, 10])
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1))
threads_per_block = 32
blocks_per_grid = ceil(arr.shape[1] / threads_per_block)
sum_idxs[threads_per_block, blocks_per_grid](arr, idxs, sum_arr)
print(sum_arr)
给出正确的结果
[[2. 1. 7.]
[2. 1. 7.]
[2. 1. 7.]]
并允许我根据需要将我的张量保留在 GPU 上。
(为了简单起见,我在这里使用了 numpy 数组。在我的代码中,我使用 cuda.as_cuda_array(tensor)
作为我的 pytorch 张量)
但是,这仍然是我代码的一个主要性能瓶颈,有没有办法进一步优化它?
这是一种可能的方法。通过每段使用一个块(或者在这种情况下,我们将每行使用一个块),通常可以相当有效地实现分段缩减。如果 segments/rows 的数量足够大,这会使 GPU 饱和。
我建议的代码设计将每行使用一个块,每个块将按顺序处理该行的 3 个段。为了处理一个段,该块将使用一个规范的 CUDA 缩减,使用一个块步幅循环来执行初始数据收集。
这是一个示例,修复了代码中评论中提到的一些问题(正确的网格尺寸标注,转换为 float32
):
$ cat t73.py
from numba import cuda,float32,int32
import numpy as np
import math
#TPB = threads per block, max of 1024
#TPB must be the power-of-2 expressed in TPBP2, i.e. TPB = 2**TPBP2
TPB = 1024
H = TPB//2
TPBP2 = 10
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
@cuda.jit
def sum_idxs_i(arr, idxs, sum_arr):
s = cuda.shared.array(shape=(TPB), dtype=float32)
tx = cuda.threadIdx.x
row = cuda.blockIdx.x
#process each of the 3 segments in a row
for j in range(3):
lower = idxs[j]
upper = idxs[j+1]
val = float32(0)
#block-stride loop to gather data from the segment
for i in range(tx+lower, upper, TPB):
val += arr[row, i]
#canonical shared-memory parallel reduction
s[tx] = val
mid = H
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] += s[tx+mid]
mid >>= 1
if tx == 0:
sum_arr[row, j] = s[0]
rows = 1000
cols = 100000
arr = np.ones(shape=(rows, cols),dtype=np.float32)
idxs = np.array([0, 2, 3, cols],dtype=np.int32)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = math.ceil(arr.shape[1] / TPB)
sum_idxs[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = (arr.shape[0])
sum_idxs_i[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
$ nvprof python t73.py
==4383== NVPROF is profiling process 4383, command: python t73.py
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
==4383== Profiling application: python t73.py
==4383== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 45.92% 287.93ms 6 47.988ms 1.1520us 144.09ms [CUDA memcpy HtoD]
44.88% 281.42ms 6 46.903ms 1.4720us 140.74ms [CUDA memcpy DtoH]
8.46% 53.052ms 1 53.052ms 53.052ms 53.052ms cudapy::__main__::sum_idxs1(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>)
0.75% 4.6729ms 1 4.6729ms 4.6729ms 4.6729ms cudapy::__main__::sum_idxs_i2(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<double, int=2, C, mutable, aligned>)
API calls: 43.26% 339.61ms 6 56.602ms 20.831us 193.89ms cuMemcpyDtoH
36.75% 288.52ms 6 48.087ms 15.434us 144.35ms cuMemcpyHtoD
18.66% 146.51ms 1 146.51ms 146.51ms 146.51ms cuDevicePrimaryCtxRetain
0.93% 7.3083ms 5 1.4617ms 4.8120us 6.7314ms cuMemFree
0.23% 1.8049ms 6 300.81us 9.4520us 778.85us cuMemAlloc
0.04% 327.52us 2 163.76us 156.34us 171.19us cuLinkAddData
0.04% 299.72us 2 149.86us 148.92us 150.80us cuModuleLoadDataEx
0.04% 276.32us 2 138.16us 131.16us 145.16us cuLinkComplete
0.02% 123.96us 2 61.978us 61.252us 62.704us cuLinkCreate
0.01% 64.406us 2 32.203us 29.439us 34.967us cuLaunchKernel
0.01% 63.184us 2 31.592us 30.251us 32.933us cuDeviceGetName
0.00% 29.454us 1 29.454us 29.454us 29.454us cuMemGetInfo
0.00% 20.732us 26 797ns 477ns 2.0320us cuCtxGetCurrent
0.00% 12.852us 25 514ns 363ns 1.0920us cuCtxGetDevice
0.00% 12.429us 2 6.2140us 1.7830us 10.646us cuDeviceGetPCIBusId
0.00% 5.0950us 10 509ns 302ns 1.0770us cuFuncGetAttribute
0.00% 3.9600us 2 1.9800us 1.8000us 2.1600us cuModuleGetFunction
0.00% 3.5630us 2 1.7810us 1.7510us 1.8120us cuLinkDestroy
0.00% 1.8970us 1 1.8970us 1.8970us 1.8970us cuCtxPushCurrent
0.00% 1.8370us 4 459ns 226ns 697ns cuDeviceGet
0.00% 1.6080us 6 268ns 181ns 481ns cuDeviceGetAttribute
0.00% 1.5060us 3 502ns 230ns 795ns cuDeviceGetCount
0.00% 1.2390us 2 619ns 428ns 811ns cuDeviceComputeCapability
$
这段代码是 运行 在 GTX960 上,它恰好通过 bandwidthTest
CUDA 示例代码报告了 ~84GB/s 的设备内存带宽。在上面的示例中,我们看到改进后的内核 运行s 在 ~4.7 毫秒内(比原始原子内核快大约 10 倍),这转化为 (1000*100000*4)bytes/4.7ms ~= 85GB/s
,因此我们可以针对此特定测试用例得出结论,这个内核大约是“最优”的。