CUDA:illegal atomicAdd 遇到内存访问

CUDA:illegal memory access was encountered with atomicAdd

只要我不增加网格大小或再次调用内核,我程序中的原子操作就可以正常工作。怎么会这样?也许共享内存不会自动释放?

  __global__ void DevTest() {
     __shared__  int* k1;
    k1 = new int(0);
    atomicAdd( k1, 1);
  }
int main()
{
for (int i = 0; i < 100; i++) DevTest << < 50, 50 >> > (); 
}

这个:

 __shared__  int* k1;

creates storage in shared memory for a pointer. The pointer is uninitialized there; it doesn't point to anything.

这个:

k1 = new int(0);

sets that pointer to point to a location on the device heap, not in shared memory. The device heap is limited by default to 8MB。 Furthermore, there is an allocation granularity, such that a single int allocation may use up more than 4 bytes of device heap space (it will).

It's generally good practice in C++ to have a corresponding delete for every new operation. Your kernel code does not have this, so as you increase the grid size, you will use up more and more of the device heap memory. You will eventually 运行 into the 8MB limit.

So there are at least 2 options to fix this:

  1. delete the allocation created with new at the end of the thread code
  2. increase the limit on the device heap, instructions are linked in the documentation above

As an aside, shared memory is shared by all the threads in the threadblock. So for your kernel launch of <<<50,50>>> you have 50 threads in each threadblock. Each of those 50 threads will see the same k1 pointer, and each will try to set it to a separate location/allocation, as each executes the new 手术。 This doesn't make any sense, and it will prevent item 1 above from working correctly (49 of the 50 allocated pointer values will be lost).

So your code doesn't really make any sense. What you are showing here is not a sensible thing to do, and there is no simple way to fix it. You could do something like this:

  __global__ void DevTest() {
    __shared__  int* k1;
    if (threadIdx.x == 0) k1 = new int(0);
    __syncthreads();
    atomicAdd( k1, 1);
    __syncthreads();
    if (threadIdx.x == 0) delete k1;
  }

Even such an arrangement could theoretically (perhaps, on some future large GPU) eventually 运行 into the 8MB limit if you launched enough blocks (i.e. if there were enough resident blocks, and taking into account an undefined allocation granularity). So the correct approach is probably to do both 1 and 2.