我如何正确实现类其成员在 Cuda/C++ 中被主机和设备代码调用？

Question

我有一个主机 class TestClass 作为一个成员，它有一个指向 class TestTable 的指针，它的数据存储在显卡。 TestClass 调用内核访问 TestTable 中的数据，以及 TestClass.

中的方法 GetValue()

在阅读了很多内容并尝试了几个关于哪些类型说明符用于哪些方法和 classes 以及如何（以及在何处）初始化 TestTable 的选项之后，我得到了感觉我所有的选择最终都会归结为相同的内存访问错误。因此，我对 Cuda/C++ 工作原理的理解可能不足以正确实施它。我的代码应该如何正确设置？

这是我的main.cu的精简版内容：

#include <iostream>
#include <cuda_runtime.h>

#define CUDA_CHECK cuda_check(__FILE__,__LINE__)
inline void cuda_check(std::string file, int line)
{
    cudaError_t e = cudaGetLastError();
    if (e != cudaSuccess) {
        std::cout << std::endl
                  << file << ", line " << line << ": "
                  << cudaGetErrorString(e) << " (" << e << ")" << std::endl;
        exit(1);
    }
}

class TestTable {

    float* vector_;
    int num_cells_;

public:

    void Init() {
        num_cells_ = 1e4;
        cudaMallocManaged(&vector_, num_cells_*sizeof(float));
        CUDA_CHECK;
    }

    void Free() {
        cudaFree(vector_);
    }

    __device__
    bool UpdateValue(int global_index, float val) {
        int index = global_index % num_cells_;
        vector_[index] = val;
        return false;
    }

};

class TestClass {

private:

    float value_;
    TestTable* test_table_;

public:

    TestClass() : value_(1.) {
        // test_table_ = new TestTable;
        cudaMallocManaged(&test_table_, sizeof(TestTable));
        test_table_->Init();
        CUDA_CHECK;
    }

    ~TestClass() {
        test_table_->Free();
        cudaFree(test_table_);
        CUDA_CHECK;
    }

    __host__ __device__
    float GetValue() {
        return value_;
    }

    __host__
    void RunKernel();

};

__global__
void test_kernel(TestClass* test_class, TestTable* test_table) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < 1e6; i += stride) {
        const float val = test_class->GetValue();
        test_table->UpdateValue(i, val);
    }
}

__host__
void TestClass::RunKernel() {
    test_kernel<<<1,1>>>(this, test_table_);
    cudaDeviceSynchronize(); CUDA_CHECK;
}

int main(int argc, char *argv[]) {

    TestClass* test_class = new TestClass();
    std::cout << "TestClass successfully constructed" << std::endl;

    test_class->RunKernel();
    std::cout << "Kernel successfully run" << std::endl;

    delete test_class;
    std::cout << "TestClass successfully destroyed" << std::endl;

    return 0;
}

我得到的错误是 line 88: an illegal memory access was encountered (700)。

我认为错误出在以下问题之一：

TestTable 未使用 new 正确创建，这可能很糟糕。但是，在 TestClass() 中取消注释 test_table_ = new TestTable; 并不能解决问题。

test_kernel

GetValue() 不是 return 有效的浮点变量。如果我用任意浮点数替换它，例如1.f，程序运行无误。但是，在我的代码的真实（不是最小）版本中，GetValue() 会在代码库的不同点进行大量计算，因此硬编码不是一个选项。
我从不复制TestClass到GPU，而是从内核中调用它的一个成员函数。我知道这一定会引起麻烦，但我觉得知道在哪里以及如何复制它并不直观。如果我只在内核中调用 GetValue() 而不重用其结果，则不会出现错误，因此我的程序似乎可以在不将 class 复制到 GPU 的情况下调用 GetValue()。

我无法应用到我的特定问题的可能相关问题：

- 这个看起来非常相似，但不知何故我无法将它转化为我的用例。
- 在这里，我不确定我有两个 classes 相互“交互”这一事实会如何影响解决方案。
CUDA and Classes - 这个问题对我来说似乎更笼统。

非常感谢任何帮助！

Answer 1

这里的问题与你如何为 TestClass 分配有关：

TestClass* test_class = new TestClass();

test_class 现在是指向 主机内存 的普通指针。如果您打算在设备代码中使用该指针：

void TestClass::RunKernel() {
    test_kernel<<<1,1>>>(this, test_table_);
                         ^^^^

和：

void test_kernel(TestClass* test_class, TestTable* test_table) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < 1e6; i += stride) {
        const float val = test_class->GetValue();
                          ^^^^^^^^^^

那行不通。在 CUDA 中，取消引用设备代码中的主机指针通常是一个基本问题。

我们可以通过使用放置 new 和托管分配器来解决这个问题，对于顶级 class:

//TestClass* test_class = new TestClass();
TestClass* test_class;
cudaMallocManaged(&test_class, sizeof(TestClass));
new(test_class) TestClass();

当我们这样做时，还需要更改释放器。如评论中所述，您还应该 make sure the destructor is called before de-allocation:

// delete test_class;
test_class->~TestClass();
cudaFree(test_class);

当我进行这些更改时，您的代码对我来说运行时没有运行时错误。

我如何正确实现类其成员在 Cuda/C++ 中被主机和设备代码调用？

How do I properly implement classes whose members are called both from host and device code in Cuda/C++?

c++

oop

cuda

我如何正确实现 类 其成员在 Cuda/C++ 中被主机和设备代码调用？

How do I properly implement classes whose members are called both from host and device code in Cuda/C++?

c++

oop

cuda

我如何正确实现类其成员在 Cuda/C++ 中被主机和设备代码调用？