超过 1 个块的 CUDA 共享内存停止工作

Question

我正在为学校编写一个使用共享内存的程序。我让函数正常工作，但它只适用于每个网格大小为 1 的块。如果每个网格有超过 1 个块，则该函数不再有效。

例如，如果我发送一个大小为 10 的数组，其中包含 1 个块和 10 个线程，则此函数可以正常工作。如果我发送同一个数组，每个块有 2 个块和 5 个线程，那么它就不再有效。当我说不再有效时，我得到了我发送的相同数组，然后退出，就像什么都没发生一样。

__global__ void rankSortShared(int* a, int n)
{
    int threadId = threadIdx.x + blockDim.x * blockIdx.x;
    int x = 0;

    // Make dynamic sized shared array
    // size will be determined from the 3rd parameter in the 
    // kernal call 'ranksortShared<<<blocksPerGrid, threadsPerBlock, *(size of shared mem)*>>>()'
    extern __shared__ int b[];

    // copy data from incomming array to shared array
    // then sync the threads so all threads have put their 
    // values into the shared array
    b[threadId] = a[threadId];
    __syncthreads();

    // now use shared array for faster lookups
    for (int j = 0; j < n; j++)
    {
        // handle duplicate values
        if (b[j] < b[threadId] || (b[threadId] == b[j] && j < threadId))
        {
            x++;
        }   
    }
    // put data back into array to be transferred back to CPU
    a[x] = b[threadId];
}

对于我的生活，我无法弄清楚为什么。是否有跨多个块使用共享内存的特殊方法？

为了尝试调试它，我用 a[threadId] = threadId 替换了所有代码并且数组看起来是正确的（打印数字 0 到 9）。所以我不明白为什么如果 threadId 看起来是正确的并且它适用于 1 个块，为什么这不起作用。

Answer 1

这里有问题：

int threadId = threadIdx.x + blockDim.x * blockIdx.x;
b[threadId] = ...;

您的 threadId 变量是一个全局唯一的线程索引。这意味着您在网格中包含的线程越多（例如，通过更多块），该索引就越高。

但是对于共享内存，每个块中的索引从零开始。所以最终，当您添加更多块时，您的 threadId 变量将大于块中的共享内存量。

通常的解决方案是这样做：

b[threadIdx.x] = ...;

因为 threadIdx.x 变量在每个块 中从零开始 。（您可以使用此策略将 b[threadId] 的每次出现替换为 b[threadIdx.x]，而不仅仅是一次出现。）

您的代码中可能还有其他问题。由于您没有显示完整的代码，因此无法诊断其他人。

并且基于这样的陈述：

When I say no longer works, I am getting the same array I sent in, back out, like nothing is happening.

我猜你没有做 proper cuda error checking。在 向他人寻求帮助之前，您真的应该这样做。即使您不理解错误输出，它也会帮助其他试图帮助您的人。

超过 1 个块的 CUDA 共享内存停止工作

CUDA shared memory more than 1 block stops working

cuda

shared-memory