如何解释 GPU 和 CPU 串行版本的均值滤波器的这些结果？

Question

我为 CPU serial 版本和 NVIDIA GPU parallel 版本实现了图像 Mean Filter 代码。我得到了运行次（请参阅设备的 results of test cases and specs。为什么 case 2 有 highest 加速而 case 3 有 lowest加速?

GPU 执行配置

        int block_size = 32;
        int grid_size = width/block_size; //width of the image in pixels
        dim3 dimBlock(block_size, block_size, 1);
        dim3 dimGrid(grid_size, grid_size, 1);

GPU 代码的时间测量

        clock_t start_d=clock();
        meanFilter_d <<< dimGrid, dimBlock >>> (image_data_d, result_image_data_d, width, height, half_window);
        cudaThreadSynchronize();
        clock_d end_d=clock();

CPU 代码的时间测量（单线程）

        clock_t start_h = clock();
        meanFilter_h(data, result_image_data_h1, width, height, window_size);
        clock_t end_h = clock();

主机代码

void meanFilter_h(unsigned char* raw_image_matrix,unsigned char* filtered_image_data,int image_width, int image_height, int window_size)
{
    // int size = 3 * image_width * image_height;
    int half_window = (window_size-window_size % 2)/2;
    for(int i = 0; i < image_height; i += 1){
        for(int j = 0; j < image_width; j += 1){
            int k = 3*(i*image_height+j);
            int top, bottom, left, right; 
            if(i-half_window >= 0){top = i-half_window;}else{top = 0;}// top limit
            if(i+half_window <= image_height-1){bottom = i+half_window;}else{bottom = image_height-1;}// bottom limit
            if(j-half_window >= 0){left = j-half_window;}else{left = 0;}// left limit
            if(j+half_window <= image_width-1){right = j+half_window;}else{right = image_width-1;}// right limit
            double first_byte = 0; 
            double second_byte = 0; 
            double third_byte = 0; 
            // move inside the window
            for(int x = top; x <= bottom; x++){
                for(int y = left; y <= right; y++){
                    int pos = 3*(x*image_height + y); // three bytes
                    first_byte += raw_image_matrix[pos];
                    second_byte += raw_image_matrix[pos+1];
                    third_byte += raw_image_matrix[pos+2];
                }
            }
            int effective_window_size = (bottom-top+1)*(right-left+1);
            filtered_image_data[k] = first_byte/effective_window_size;
            filtered_image_data[k+1] = second_byte/effective_window_size;
            filtered_image_data[k+2] =third_byte/effective_window_size;


        }
    }
}

设备代码

__global__ void meanFilter_d(unsigned char* raw_image_matrix, unsigned char* filtered_image_data, int image_width, int image_height, int half_window)
{
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    int i = blockIdx.y * blockDim.y + threadIdx.y;

    if (i < image_height && j < image_width){
        int k = 3*(i*image_height+j);
        int top, bottom, left, right; 
        if(i-half_window >= 0){top = i-half_window;}else{top = 0;}// top limit
        if(i+half_window <= image_height-1){bottom = i+half_window;}else{bottom = image_height-1;}// bottom limit
        if(j-half_window >= 0){left = j-half_window;}else{left = 0;}// left limit
        if(j+half_window <= image_width-1){right = j+half_window;}else{right = image_width-1;}// right limit
        double first_byte = 0; 
        double second_byte = 0; 
        double third_byte = 0; 
        // move inside the window
        for(int x = top; x <= bottom; x++){
            for(int y = left; y <= right; y++){
                int pos = 3*(x*image_height + y); // three bytes
                first_byte += raw_image_matrix[pos];
                second_byte += raw_image_matrix[pos+1];
                third_byte += raw_image_matrix[pos+2];
            }
        }
        int effective_window_size = (bottom-top+1)*(right-left+1);
        filtered_image_data[k] = first_byte/effective_window_size;
        filtered_image_data[k+1] = second_byte/effective_window_size;
        filtered_image_data[k+2] =third_byte/effective_window_size;
    }
}

可以看出，3×3 内核的两种图像大小都比 5*5 内核慢。由于图像尺寸较大，案例 1 比案例 3 具有更多的并行性。因此，案例 1 的设备利用率高于案例 3。但我没有进一步解释的想法。请给我一些见解。

Answer 1

首先要指出的是：你在测量什么，最重要的是如何？从你的问题是不可能推断出特别是如何。

无论如何，我强烈建议您看一看，这是 Mark Harris 的一个非常简单和有用的 article，它解释了一些对设备端代码（即 CUDA 内存传输）的执行时间采样的良好实践，内核等）。

顺便说一句，试图获得 CPU/GPU 加速 是一个相当棘手的话题，这是由于两种架构的本质不同。即使您的 CPU 和 GPU 代码显然在做同样的事情，也有很多您可能想要考虑的因素（例如 CPU 核心、GPU 流多处理器和每个 SM 的核心）。 Here Robert Crovella 对类似问题给出了很好的答案，他说：

If you make any claims about "the GPU is faster than the CPU by XX", then IMO you are well-advised to compare only codes that do the same work and efficiently and effectively use the underlying architectures (for both CPU and GPU). For example in the CPU case you should certainly be using a multi-threaded code, so as to take advantage of the multiple CPU cores that most modern CPUs offer. These sorts of claims are likely to be viewed with skepticism anyway, so probably best to avoid them unless it is the crux of your intent.

我建议你也看看 this 讨论。

经过一些前提，我不认为你可以认为那些加速是可靠的（事实上那些对我来说似乎有点奇怪）。
试图解释你想说的话：

It can be seen that both image sizes with 3×3 kernel is slower

也许你想说在 3x3 中你获得了较小的加速 w.r.t。那些适合 5x5 window 大小的。尽量准确一点。

Why case 2 has the highest speedup and case 3 has the lowest speedup?

好吧，根据您提供的糟糕信息很难推断出某些东西。

请添加：一些代码以查看您在做什么以及如何在设备和主机案例中实现问题，描述您如何测量以及测量什么。

编辑：

嗯，我觉得你应该采取更准确的方式来衡量。

首先，我建议您使用 clock() 更准确的替代方法。查看答案 here 和 C++ 参考，我建议您考虑使用

std::chrono::system_clock::now()

std::chrono::high_resolution_clock::now();

然后我重复你阅读 Mark Harris 的文章（上面链接）。他在这里说

A problem with using host-device synchronization points, such as cudaDeviceSynchronize(), is that they stall the GPU pipeline. For this reason, CUDA offers a relatively light-weight alternative to CPU timers via the CUDA event API. The CUDA event API includes calls to create and destroy events, record events, and compute the elapsed time in milliseconds between two recorded events.

这意味着您提供的措施的实际结果可能比使用 cudaDeviceSynchronize() 少 "distorted"。此外，如果您使用简单的 cudaMemcpy，则没有必要使用同步机制，因为它是一个同步调用。

还考虑包括 H2D/D2H 传输，根据我的说法，在 CPU/GPU 比较中考虑此开销很重要（但这个选择取决于您）；
关于你在图片中给出的措施，是否是直接结果或重复不同执行的平均值（可能丢弃外层值）？

我认为你应该按照以上建议对新的措施进行抽样，并对获得的新措施进行考虑。

顺便说一句

Case 1 has more parallelism than the case 3 due to the larger image size. Therefore, Utilization of device of case 1 is higher than that of case 3.

我不同意，因为你int grid_size = width/block_size;

案例 1：grid_size = 640/32 = 20

案例 2：grid_size = 1280/32 = 40

所以在情况 2 中你有更高的并行度。但是因为你只有 2 个 SM，这可能是时间比你预期的要长的原因。换句话说，您有更多的块 (40*40) 等待计算两个 SM。

如何解释 GPU 和 CPU 串行版本的均值滤波器的这些结果？

How to interpret these results for mean filter for both GPU and CPU serial versions?

cuda

gpu

nvidia