CUDA 中未对齐的地址

Question

谁能告诉我 CUDA 内核中的以下代码有什么问题：

__constant__ unsigned char MT[256] = {
    0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....};

typedef unsinged int U32;

__global__ void Kernel (unsigned int  *PT, unsigned int  *CT, unsigned int  *rk)
{

    long int i;
    __shared__ unsigned char sh_MT[256];    

    for (i = 0; i < 64; i += 4)
        ((U32*)sh_MT)[threadIdx.x + i] = ((U32*)MT)[threadIdx.x + i];

    __shared__ unsigned int sh_rkey[4];
    __shared__ unsigned int sh_state_pl[4];
    __shared__ unsigned int sh_state_ct[4];

    sh_state_pl[threadIdx.x] = PT[threadIdx.x];
    sh_rkey[threadIdx.x] = rk[threadIdx.x];
    __syncthreads();


    sh_state_ct[threadIdx.x] = ((U32*)sh_MT)[sh_state_pl[threadIdx.x]]^\
    ((U32*)(sh_MT+3))[((sh_state_pl[(1 + threadIdx.x) % 4] >> 8) & 0xff)] ^ \
    ((U32*)(sh_MT+2))[((sh_state_pl[(2 + threadIdx.x) % 4] >> 16) & 0xff)] ^\
    ((U32*)(sh_MT+1))[((sh_state_pl[(3 + threadIdx.x) % 4] >> 24) & 0xff )];


    CT[threadIdx.x] = sh_state_ct[threadIdx.x];
}

在这行代码中，

((U32*)(sh_MT+3))......

CUDA 调试器给我错误信息： 未对齐的地址

我该如何解决这个错误？

我在 MVSC 中使用 CUDA 7，我使用 1 个块和 4 个线程来执行内核函数，如下所示：

__device__ unsigned int *state;
__device__ unsigned int *key;
__device__ unsigned int *ct;
.
.
main()
{
cudaMalloc((void**)&state, 16);
cudaMalloc((void**)&ct, 16);
cudaMalloc((void**)&key, 16);
//cudamemcpy(copy some values to => state , ct, key);   
Kernel << <1, 4 >> >(state, ct, key); 
}

请记住，我无法更改 "MT Table" 类型。在此先感谢您的任何建议或回答。

Answer 1

错误信息的意思是指针没有对齐到处理器要求的边界。

来自CUDA Programming Guide, section 5.3.2：

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

这就是调试器试图告诉您的内容：基本上，您不应该从未在 32 位边界对齐的地址取消引用指向 32 位值的指针。

你可以做 (U32*)(sh_MT) 和 (U32*)(sh_MT+4) 就好了，但不能做 (U32*)(sh_MT+3) 之类的。

您可能必须单独读取字节并将它们连接在一起。

CUDA 中未对齐的地址

Misaligned address in CUDA

cuda

gpu

gpgpu

nvidia

alignment