使用 Python Cuda 创建共享内存代码
Created Shared Memory Code with Python Cuda
我正在努力获取一些代码 运行 探索共享内存功能以获得快速矩阵乘法。但每次我尝试这个时,我似乎 运行 都会犯下我无法理解的错误。
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
@cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
和测试
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a@b
output = d_c.copy_to_host()
不断导致以下错误:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
在和一个答案的提供者聊天后,我更新了这个功能。但仍然无法使这项工作。因此,为了计算输出 c 中每个元素的总和,我们需要使用 i 作为索引遍历 A 的列和 B 的行。因此我们有 n*n 个产品。我认为我在总和中是正确的,但我似乎无法在总和表达式中获得 a 和 b 的行和列的正确索引。
import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
您的块大小无效。 CUDA 设备每个块有 1024 个线程 limit。当我 运行 你的代码时,我看到了这个:
/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
327 _logger.critical(msg, _getpid(), self.pid)
328 raise CudaDriverError("CUDA initialized before forking")
--> 329 raise CudaAPIError(retcode, msg)
330
331 def get_device(self, devnum=0):
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
当我解决这个问题时,我看到了这个:
$ cuda-memcheck python somethingsometing.py
========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
========= at 0x000008b0 in cudapy::__main__::mm1(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
========= by thread (15,11,0) in block (3,2,0)
========= Address 0x00000ec0 is out of bounds
原因很明显:
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
row
和 column
是执行网格中的维度,而不是本地共享内存块,同样 i
受 a
的形状限制,而不是a_cache
的形状(另请注意,您似乎在代码的一半处失去了 C 风格的二维数组索引语法,如果您不理解 [=41 中两者之间的区别,这是一个潜在的错误=]).
要修复它,您必须更改索引,然后实现乘法的其余代码(即,您必须通过本地共享切片迭代加载整个行和列切片,以计算每个切片的完整点积row/column 块将处理的对)。
另请注意
- 您为
c
选择的尺寸错误(应为 m x m)
- 你 运行 内核的网格大小也是错误的,因为 C 的维度是错误的,所以你的代码永远无法计算整个矩阵[=36=]
- 即使解决了所有这些问题,由于整数溢出,乘法的结果很可能在除了微不足道的大小以外的任何地方都不正确。
@disruptive:嗨,你找到解决问题的办法了吗?
我和你有同样的问题,但是我通过重启Jupyter notebook的内核解决了。
我的代码与你的略有不同:
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
col, row = cuda.grid(2)
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]
for i in range(a.shape[1]):
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]
cuda.syncthreads()
for j in range(N):
sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]
# Wait until all threads finish computing
cuda.syncthreads()
c[row][col] = sum
如果您有任何更新,请告诉我。
这是正确的解决方案:
import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
x,y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
TPB = int(N)
for i in range(a.shape[1] / TPB):
a_cache[tx, ty] = a[x, ty + i * TPB]
b_cache[tx, ty] = b[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):#a.shape[1]):
# TODO: calculate the `sum` value correctly using values from the cache
sum += a_cache[tx][j] * b_cache[j][ty]
cuda.syncthreads()
c[x][y] = sum
我正在努力获取一些代码 运行 探索共享内存功能以获得快速矩阵乘法。但每次我尝试这个时,我似乎 运行 都会犯下我无法理解的错误。
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
@cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
和测试
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a@b
output = d_c.copy_to_host()
不断导致以下错误:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
在和一个答案的提供者聊天后,我更新了这个功能。但仍然无法使这项工作。因此,为了计算输出 c 中每个元素的总和,我们需要使用 i 作为索引遍历 A 的列和 B 的行。因此我们有 n*n 个产品。我认为我在总和中是正确的,但我似乎无法在总和表达式中获得 a 和 b 的行和列的正确索引。
import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
您的块大小无效。 CUDA 设备每个块有 1024 个线程 limit。当我 运行 你的代码时,我看到了这个:
/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
327 _logger.critical(msg, _getpid(), self.pid)
328 raise CudaDriverError("CUDA initialized before forking")
--> 329 raise CudaAPIError(retcode, msg)
330
331 def get_device(self, devnum=0):
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
当我解决这个问题时,我看到了这个:
$ cuda-memcheck python somethingsometing.py
========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
========= at 0x000008b0 in cudapy::__main__::mm1(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
========= by thread (15,11,0) in block (3,2,0)
========= Address 0x00000ec0 is out of bounds
原因很明显:
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
row
和 column
是执行网格中的维度,而不是本地共享内存块,同样 i
受 a
的形状限制,而不是a_cache
的形状(另请注意,您似乎在代码的一半处失去了 C 风格的二维数组索引语法,如果您不理解 [=41 中两者之间的区别,这是一个潜在的错误=]).
要修复它,您必须更改索引,然后实现乘法的其余代码(即,您必须通过本地共享切片迭代加载整个行和列切片,以计算每个切片的完整点积row/column 块将处理的对)。
另请注意
- 您为
c
选择的尺寸错误(应为 m x m) - 你 运行 内核的网格大小也是错误的,因为 C 的维度是错误的,所以你的代码永远无法计算整个矩阵[=36=]
- 即使解决了所有这些问题,由于整数溢出,乘法的结果很可能在除了微不足道的大小以外的任何地方都不正确。
@disruptive:嗨,你找到解决问题的办法了吗? 我和你有同样的问题,但是我通过重启Jupyter notebook的内核解决了。
我的代码与你的略有不同:
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
col, row = cuda.grid(2)
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]
for i in range(a.shape[1]):
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]
cuda.syncthreads()
for j in range(N):
sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]
# Wait until all threads finish computing
cuda.syncthreads()
c[row][col] = sum
如果您有任何更新,请告诉我。
这是正确的解决方案:
import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
x,y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
TPB = int(N)
for i in range(a.shape[1] / TPB):
a_cache[tx, ty] = a[x, ty + i * TPB]
b_cache[tx, ty] = b[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):#a.shape[1]):
# TODO: calculate the `sum` value correctly using values from the cache
sum += a_cache[tx][j] * b_cache[j][ty]
cuda.syncthreads()
c[x][y] = sum