cuda group by 和原子计数器

cuda group by and atomic counters

我有一个无序的标记元素数组:

[0,1,0,2,0,1,2] // labels only

我要排序的:

[0,0,0,1,1,2,2]

我已经计算出每个标签标记了多少个元素并减少为偏移量数组:

[0,3,5]

意思是我知道我需要存储所有从位置 0 开始的 0 标签元素,从位置 3 开始的 1 标签元素等等。

template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
    int index = threadIdx.x + blockDim.x * blockIdx.x;
    T elem = input[index];
    output[offsets[elem.label]] = elem; // problem here
    atomicAdd(offsets[label], 1);
}

但是,atomicAdd之前的读写操作不是原子的,所以我有内存竞争条件。 我不会把它用于单个计数器,因为

int count = 0;
atomicAdd(&count, 1);
output[count] = elem;

确实会给我每个线程一个唯一的计数器。

我怎样才能解决这个问题并拥有动态的原子计数器数组?

感谢 talonmies 关于 atomicAdd 的 return 值的善意提醒,我已经能够修复我的内核:

template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
    int index = threadIdx.x + blockDim.x * blockIdx.x;
    T elem = input[index];
    int oldOffset = atomicAdd(&offsets[elem.label], 1);
    output[oldOffset] = elem; 
}

事实上,atomicAdd 自动递增存储在第一个参数中的内容,return旧值:

[atomicAdd(address, val)] reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd