使用 GPU 进行像素格式转换？

Question

我正在尝试非常高效地将打包的 24bpp RGB 图像转换为打包的 32bpp RGBA。我尝试使用 Accelerate.framework 中的 vImageConvert_RGB888toRGBA8888，但想知道是否有更快的方法通过在 Metal 中使用计算内核。我在 Metal 中尝试了几种不同的方法，但结果总是比 Accelerate.framework 慢得多，即使对于像素超过 100 万的大图像也是如此。

这是我的计算内核的样子：

kernel void rgb24_to_rgba32(texture2d<half, access::read> inTexture [[texture(0)]],
                     texture2d<half, access::write> outTexture [[texture(1)]],
                     uint2 id [[ thread_position_in_grid ]])
{   
    uint2 srcAddr1 = uint2(id.x * 3, id.y);
    uint2 srcAddr2 = uint2(id.x * 3 + 1, id.y);
    uint2 srcAddr3 = uint2(id.x * 3 + 2, id.y);

    outTexture.write(half4(inTexture.read(srcAddr1).r, inTexture.read(srcAddr2).r, inTexture.read(srcAddr3).r, 1), id);

    return;
}

我将 inTexture 定义为 r8Unorm，将 outTexture 定义为 bgra8Unorm。两个纹理都使用 .storageModeShared 加载，因此不应发生任何内存复制。

代码有效，转换正确执行，但性能不佳。我尝试了不同的 threadgroupsPerGrid 和 threadsPerThreadgroup 设置，但其中 none 的性能与 Accelerate.framework.

相当

例如，在 A7（第 1 代 iPad Air）上，一张 1024x1024 的图像大约需要 32 毫秒，而使用 Accelerate.framework 需要 6 毫秒。有趣的是，对于更快的设备，例如基于 A9 的 iPhone 6s（GPU 上为 1.5 ms，使用 Accelerate 为 1.1 ms），差异要小得多，但 Metal 实现总是更慢。

这不是一个 GPU 友好的操作（可能是由于无数未对齐的内存访问？）在最大化我的计算内核的性能方面我是否遗漏了一些基本的东西？

更新：我最终能够使用以下实现实现比上述更好的性能：

此方法利用 packed_uint3 的 96 位读取和 packed_uint4 的 128 位写入来显着提高性能。

#define RGB24_TO_RGBA32_PIXEL1(myUint) (myUint | 0xff000000)

#define RGB24_TO_RGBA32_PIXEL2(myUint1, myUint2) (myUint1 >> 24 | \
                                                ((myUint2) << 8) | 0xff000000)


#define RGB24_TO_RGBA32_PIXEL3(myUint2, myUint3) (myUint2 >> 16 | \
                                                ((myUint3) << 16) | 0xff000000)

#define RGB24_TO_RGBA32_PIXEL4(myUint3) ((myUint3 >> 8) | 0xff000000)

inline packed_uint4 packed_rgb24_to_packed_rgba32(packed_uint3 src) {
    return uint4(RGB24_TO_RGBA32_PIXEL1(src[0]),
                 RGB24_TO_RGBA32_PIXEL2(src[0], src[1]),
                 RGB24_TO_RGBA32_PIXEL3(src[1], src[2]),
                 RGB24_TO_RGBA32_PIXEL4(src[2]));
}

kernel void rgb24_to_rgba32_textures(
                         constant packed_uint3 *src [[ buffer(0) ]],
                         device packed_uint4 *dest [[ buffer(1) ]],
                         uint2 id [[ thread_position_in_grid ]])
{
    // Process 8 pixels per thread (two packed_uint3s, each containing 4 pixels):
    uint index = id.x  * 2;
    dest[index] = packed_rgb24_to_packed_rgba32(src[index]);
    dest[index + 1] = packed_rgb24_to_packed_rgba32(src[index + 1]);
    return;
}

通过这种方法，旧设备上的性能差异变得更小（加速比 GPU 快约 2 倍），而在更现代的 (A9) 设备上，Metal 实际上最终约为 40 -50% 快.

我试过每个线程处理一个、两个或更多 packed_uint3 个向量，结论是两个向量是性能的最佳点。

Answer 1

这里有几个途径可供探索。我不能保证您会让 Metal 在您的目标设备上击败 Accelerate，但也许有一点加速的机会。

考虑使用缓冲区而不是纹理。您的输入缓冲区可以是 packed_char3 类型，您的输出缓冲区可以是 packed_char4 类型。然后，不必每次写入执行三个纹理读取，您可以为每个像素索引一次源缓冲区。如您所见，这些读取中的大部分都是未对齐的，但这种方法可能会为您节省一些格式转换和带宽。
考虑为每次内核调用做更多的工作。如果您的图像尺寸是 4 或 8 的倍数（例如），您可以使用一个循环（应该由编译器展开）来处理内核中的那么多像素，从而减少需要分派的线程组数量。

Accelerate 非常适合您的用例，因此您可能希望坚持使用它，除非您的 CPU 时间很紧，或者您可以忍受将工作分派给 GPU 并等待的延迟结果。

Answer 2

为了结束，这里是 Apple 的开发者关系部门对这个问题的回应。 底线是 GPU 在这种情况下并没有提供任何真正的优势，因为这种转换不是计算量大的操作。

After discussions with engineering, and evaluating more sample implementations, the verdict is out on Metal v.s. Accelerate performance for converting packed 24bpp RGB images to packed 32bpp RGBA images: on newer devices you can get close to the same performance using Metal but Accelerate will be faster for this operation. “vImage is an extremely well-tuned implementation and since this conversion operation is not compute heavy the best we can do is to be at parity.”

The proposed reasoning behind this is data locality and efficiently operating on multiple pixels at a time (something you’ve mentioned). The fastest Metal implementation tested processed two pixels per thread and still lagged behind vImageConvert_RGB888toRGBA8888.

There was an “optimized” implementation using Metal buffers rather than textures (something else that you’d mentioned exploring) and surprisingly this approach was slightly less performant.

Lastly, adjustment of thread groups came into discussion as well as tuning by adding code to the kernel to handle the case where the thread position in grid is outside the destination image. Again, despite these considerations Accelerate remained as the fastest implementation.

我应该补充一点，使用 Metal 的一个真正优势是 CPU 使用率，虽然它并不快，但确实显着减少了 CPU 的工作量。对于 CPU 负载很重的应用程序，Metal 方法实际上可能有意义。

使用 GPU 进行像素格式转换？

Using GPU for pixel format conversions?

performance

gpu

gpgpu

ios

metal