CUDA - 动态共享内存触发器 thrust::system::system_error

CUDA - dynamic shared memory triggers thrust::system::system_error

我刚开始通过 Udacity 学习 CUDA 编程。即使尝试使用动态共享内存,我也会收到以下错误。

CUDA error at: main.cpp:55
invalid argument cudaGetLastError()
terminate called after throwing an instance of thrust::system::system_error'
what():  unload of CUDA runtime failed

We are unable to execute your code. Did you set the grid and/or block size correctly?

我搜索了很多,但仍然不知道哪里出了问题。有趣的是,如果我将最后两行更改为

    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);   
    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1); 

,运行代码时没有抛出错误。但是,它没有意义,因为用于动态内存分配的 space 不应限于常量。也许这不是我的代码,而是 Udacity 上的设置?我写的代码如下。任何帮助将不胜感激。

__global__ void compact_kernel(unsigned int* const d_inputVals,
    unsigned int* const d_inputPos,
    unsigned int* const d_outputVals,
    unsigned int* const d_outputPos,
    const size_t numElems,
    const size_t refBit)
{
    const size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    // predicate
    const bool predicate = (d_inputVals[tid] & 1) == refBit;
    extern __shared__ int s[];   
}

void your_sort(unsigned int* const d_inputVals,
    unsigned int* const d_inputPos,
    unsigned int* const d_outputVals,
    unsigned int* const d_outputPos,
    const size_t numElems)
{ 
    const size_t numBlocks = numElems/512;
    const size_t numThreadsPerBlock = 256;
    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);   
    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1); 

}`

编辑: numElems 的值为 220480。这个数字对于动态内存分配来说是否太大?

根据 programming guide.

,对于所有当前 CUDA 设备,共享内存限制为每个线程块 48 KB