CUDA - 动态共享内存触发器 thrust::system::system_error
CUDA - dynamic shared memory triggers thrust::system::system_error
我刚开始通过 Udacity 学习 CUDA 编程。即使尝试使用动态共享内存,我也会收到以下错误。
CUDA error at: main.cpp:55
invalid argument cudaGetLastError()
terminate called after throwing an instance of thrust::system::system_error'
what(): unload of CUDA runtime failed
We are unable to execute your code. Did you set the grid and/or block size correctly?
我搜索了很多,但仍然不知道哪里出了问题。有趣的是,如果我将最后两行更改为
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1);
,运行代码时没有抛出错误。但是,它没有意义,因为用于动态内存分配的 space 不应限于常量。也许这不是我的代码,而是 Udacity 上的设置?我写的代码如下。任何帮助将不胜感激。
__global__ void compact_kernel(unsigned int* const d_inputVals,
unsigned int* const d_inputPos,
unsigned int* const d_outputVals,
unsigned int* const d_outputPos,
const size_t numElems,
const size_t refBit)
{
const size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
// predicate
const bool predicate = (d_inputVals[tid] & 1) == refBit;
extern __shared__ int s[];
}
void your_sort(unsigned int* const d_inputVals,
unsigned int* const d_inputPos,
unsigned int* const d_outputVals,
unsigned int* const d_outputPos,
const size_t numElems)
{
const size_t numBlocks = numElems/512;
const size_t numThreadsPerBlock = 256;
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1);
}`
编辑:
numElems 的值为 220480。这个数字对于动态内存分配来说是否太大?
,对于所有当前 CUDA 设备,共享内存限制为每个线程块 48 KB
我刚开始通过 Udacity 学习 CUDA 编程。即使尝试使用动态共享内存,我也会收到以下错误。
CUDA error at: main.cpp:55
invalid argument cudaGetLastError()
terminate called after throwing an instance of thrust::system::system_error'
what(): unload of CUDA runtime failed
We are unable to execute your code. Did you set the grid and/or block size correctly?
我搜索了很多,但仍然不知道哪里出了问题。有趣的是,如果我将最后两行更改为
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1);
,运行代码时没有抛出错误。但是,它没有意义,因为用于动态内存分配的 space 不应限于常量。也许这不是我的代码,而是 Udacity 上的设置?我写的代码如下。任何帮助将不胜感激。
__global__ void compact_kernel(unsigned int* const d_inputVals,
unsigned int* const d_inputPos,
unsigned int* const d_outputVals,
unsigned int* const d_outputPos,
const size_t numElems,
const size_t refBit)
{
const size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
// predicate
const bool predicate = (d_inputVals[tid] & 1) == refBit;
extern __shared__ int s[];
}
void your_sort(unsigned int* const d_inputVals,
unsigned int* const d_inputPos,
unsigned int* const d_outputVals,
unsigned int* const d_outputPos,
const size_t numElems)
{
const size_t numBlocks = numElems/512;
const size_t numThreadsPerBlock = 256;
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);
compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1);
}`
编辑: numElems 的值为 220480。这个数字对于动态内存分配来说是否太大?