CUDA 中 warp 展开期间的线程同步

Question

我正在努力了解 Mark Harris reduction in CUDA 中的减少技术 #5。

Reduction #5 通过应用最后一个 warp 展开改进了先前的 reduction #4。

幻灯片 21 提到：“我们不需要 __syncthreads()”，这是我不理解的部分。

主要逻辑代码如下：

__device__ void warpReduce(volatile int* sdata, int tid) {
  sdata[tid] += sdata[tid + 32]; // line A
  sdata[tid] += sdata[tid + 16]; // line B
  sdata[tid] += sdata[tid + 8];
  sdata[tid] += sdata[tid + 4];
  sdata[tid] += sdata[tid + 2];
  sdata[tid] += sdata[tid + 1];
}

// later...
for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
  if (tid < s)
    sdata[tid] += sdata[tid + s];
  __syncthreads();
}
if (tid < 32) warpReduce(sdata, tid);

我不明白为什么 A 行和 B 行之间（以及下一个之间）没有 __syncthreads()行也是如此）。

我的问题：是否有可能在同一个 warp 中，一个线程在另一个线程执行行 B 之前执行行 A? （好像不行，有没有人确认一下，详细说说）

Answer 1

is it possible that within the same warp one thread executes line B prior to the other thread executes line A?

在撰写本文时（大约 10 年前），不可能发生这种情况，因为可以保证 warp 以锁步方式执行。请注意，需要声明有问题的内存 volatile 以防止编译器优化缓存 Fermi 和较新 GPU 的缩减步骤之间的结果。在不需要的原始 Tesla 架构上。

但是，state-of-the-art 执行 warp 级操作的方式已经改变，这种设计模式在一些最新的架构上可能不安全。相反，您应该更喜欢 warp level primitives 来减少而不是隐式 warp 同步。有关详细信息，请参阅 this blog post。

CUDA 中 warp 展开期间的线程同步

thread synchronization during warp unrolling in CUDA

synchronization

cuda

nvidia