CUDA将数据从全局内存中缓存到统一缓存中,存储到共享内存中?

CUDA caches data into the unified cache from the global memory to store them into the shared memory?

据我所知,GPU按照步骤(global memory-l2-l1-register-shared memory)将数据存储到以前NVIDIA GPU架构的共享内存中。

但是maxwell gpu(GTX980)在物理上已经将统一的缓存和共享内存分开了,我想知道这种架构也按照同样的步骤将数据存储到共享内存中吗?或者它们是否支持全局内存和共享内存之间的直接通信?

这可能会回答您关于 Maxwell 架构中的内存类型和步骤的大部分问题:

As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.

In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.

Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.

The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.

来自 Nvidia Maxwell tuning guide 中的“1.4.2. 内存吞吐量”部分的“1.4.2.1. 统一 L1/Texture 缓存”小节。

这两个部分之后的其他部分和子部分也教授 and/or 关于共享内存 sizes/bandwidth、缓存等的明确有用的其他细节。 试一试!