从 AoS 到 SoA 的转换 - 写入性能和最佳打包

Question

我目前正在考虑将我的 CUDA 内核中使用的数据存储从结构数组 (AoS) 转换为数组结构 (SoA)。

我有一个结构 Element:

struct Element
{
    float3 origin;
    float3 direction;
    uint8_t count1;
    uint8_t count2;
    unsigned int index;
    float distance;
    uint16_t instanceId;
    uint64_t hash;
};

这些结构被kernel1作为一个整体写入驻留在全局内存中的数组中，然后条目的子集被用于多个后续内核。

我现在可以将其转换为以下结构：

struct ElementSoA
{
    float3 origin[N];
    float3 direction[N];
    uint8_t count1[N];
    uint8_t count2[N];
    unsigned int index[N];
    float distance[N];
    uint16_t instanceId[N];
    uint64_t hash[N];
};

问题：

1) 如果我有 8 个独立的 "small" 写入 kernel1 而不是 1 个 "big" 写入，写入性能会受到影响吗？

2) ElementSoA 中的 "pack" 部分条目是否有意义，例如将 count1 和 count2 组合成

  struct uint8_2
  {
       uint8_t count1;
       uint8_t count2;
  };

3）如果"packing"有用，有没有办法计算ElementSoA的最优结构？假设我有一个像这样的每个内核读取访问的列表：

kernel2: origin, direction, hash
kernel3: count1, count2, distance, hash
...

我要求计算最优解的原因是我有多个结构，它们包含的条目比 Element 还要多，所以我需要实施和测试大量的组合.

Answer 1

假设线程 i 访问元素 i 且仅访问元素 i。从上下文来看似乎是这样。

Is the write performance affected if I have 8 separate, "small" writes in kernel1 instead of 1 "big" write?

是。它应该更快。 "big" 的写操作会被编译器拆分成多个小的写操作，每个小的写操作都会被跨步。当访问模式不受约束时，内存子系统工作得更好。

这里值得指出的是，您使用的 float3 类型也将像这样工作，并且将被拆分为三个 32 位事务。不过，您也没有理由不能将它们从 AoS 转换为 SoA。

Would it make sense to "pack" parts of the entries within ElementSoA?

是。更大的对齐幂幂类型（在当前硬件上，最多 128 位）允许硬件更有效地加载和存储。差别不大，但如果很容易做到，那通常是值得的。

If "packing" is useful, is there a way to calculate the optimal structure of ElementSoA?

没有办法计算这个。一个问题是内核具有不同的特性。可能是一个带宽受限很大，因此使用高效负载会有所帮助。另一个可能受计算限制，因此更高效的负载不会有太大帮助。

从 AoS 到 SoA 的转换 - 写入性能和最佳打包

Converting from AoS to SoA - Write performance and Optimal packing

arrays

struct

cuda