cuda group by 和原子计数器
cuda group by and atomic counters
我有一个无序的标记元素数组:
[0,1,0,2,0,1,2] // labels only
我要排序的:
[0,0,0,1,1,2,2]
我已经计算出每个标签标记了多少个元素并减少为偏移量数组:
[0,3,5]
意思是我知道我需要存储所有从位置 0 开始的 0 标签元素,从位置 3 开始的 1 标签元素等等。
template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
T elem = input[index];
output[offsets[elem.label]] = elem; // problem here
atomicAdd(offsets[label], 1);
}
但是,atomicAdd之前的读写操作不是原子的,所以我有内存竞争条件。
我不会把它用于单个计数器,因为
int count = 0;
atomicAdd(&count, 1);
output[count] = elem;
确实会给我每个线程一个唯一的计数器。
我怎样才能解决这个问题并拥有动态的原子计数器数组?
感谢 talonmies 关于 atomicAdd 的 return 值的善意提醒,我已经能够修复我的内核:
template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
T elem = input[index];
int oldOffset = atomicAdd(&offsets[elem.label], 1);
output[oldOffset] = elem;
}
事实上,atomicAdd 自动递增存储在第一个参数中的内容,return旧值:
[atomicAdd(address, val)] reads the 16-bit, 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old + val), and stores
the result back to memory at the same address. These three operations
are performed in one atomic transaction. The function returns old.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd
我有一个无序的标记元素数组:
[0,1,0,2,0,1,2] // labels only
我要排序的:
[0,0,0,1,1,2,2]
我已经计算出每个标签标记了多少个元素并减少为偏移量数组:
[0,3,5]
意思是我知道我需要存储所有从位置 0 开始的 0 标签元素,从位置 3 开始的 1 标签元素等等。
template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
T elem = input[index];
output[offsets[elem.label]] = elem; // problem here
atomicAdd(offsets[label], 1);
}
但是,atomicAdd之前的读写操作不是原子的,所以我有内存竞争条件。 我不会把它用于单个计数器,因为
int count = 0;
atomicAdd(&count, 1);
output[count] = elem;
确实会给我每个线程一个唯一的计数器。
我怎样才能解决这个问题并拥有动态的原子计数器数组?
感谢 talonmies 关于 atomicAdd 的 return 值的善意提醒,我已经能够修复我的内核:
template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
T elem = input[index];
int oldOffset = atomicAdd(&offsets[elem.label], 1);
output[oldOffset] = elem;
}
事实上,atomicAdd 自动递增存储在第一个参数中的内容,return旧值:
[atomicAdd(address, val)] reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd