在 x64 上再次读取之前在未缓存的地址写入完整的缓存行

Writing a full cache line at an uncached address before reading it again on x64

在 x64 上，如果您首先在短时间内将完整缓存行的内容写入先前未缓存的地址，然后很快再次从该地址读取，则可以CPU 避免从内存中读取该地址的旧内容？

事实上，内存的内容之前是什么应该无关紧要，因为整个缓存行的数据都被完全覆盖了？我可以理解，如果它是一个未缓存地址的部分缓存行写入，然后是读取，那么它会产生必须与主内存同步等的开销。

查看有关写入分配、写入组合和侦听的文档让我对此事有些困惑。目前我认为 x64 CPU 不能做到这一点？

一般来说，后续读取应该很快——只要store-to-load forwarding能够工作。事实上，它与写入整个缓存行完全无关：即使是较小的写入，它也应该工作（但有相同的警告）！

基本上在正常（即 WB 内存区域）映射内存上发生的事情是，存储将向 [=51= 的 存储缓冲区 添加几个条目].由于关联的内存当前未缓存，这些条目将保留一段时间，因为将发生 RFO 请求将该行拉入缓存以便写入。

与此同时，您发出一些针对刚刚写入的相同内存的加载，这些通常会通过 存储到加载转发 来满足，这几乎只是注意到一个存储已经在同一地址的存储缓冲区中，并将其用作加载的结果，而无需进入内存。

现在，商店转发并不总是有效。特别是，当负载仅部分与最近涉及的负载重叠时，它从不适用于任何英特尔（或可能的AMD）CPU店铺。也就是说，如果您向地址 10 写入 4 个字节，然后从地址 9 读取 4 个字节，则只有 3 个字节来自该写入，而 9 处的字节必须来自其他地方。在这种情况下，所有 Intel CPU 只需等待所有涉及的存储被写入，然后解析负载。

过去有很多其他情况也会失败，例如，如果您发出更小的读取，而该读取完全包含在较早的存储中，则通常会失败。例如，给定对地址 10 的 4 字节写入，从地址 12 读取的 2 字节完全包含在较早的写入中 - 但通常不会转发，因为硬件不够复杂，无法检测到这种情况。

然而，最近的趋势是，除上述 "not fully contained read" 案例之外的所有案例都在现代 CPUs 上成功转发。血淋淋的细节被很好地覆盖，配上漂亮的图片，on stuffedcow and Agner also covers it well in his microarchitecture guide。

从上面的链接文档中，以下是 Agner 对 Skylake 上的存储转发的看法：

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

A write followed by a smaller read from the same address has little or no penalty.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

最后一种读大于写的情况肯定是存储转发停顿的情况。 11 个周期的引述可能适用于所有涉及的字节都在 L1 中的情况——但如果某些字节根本没有被缓存（您的场景），它当然可以采用 DRAM 未命中的顺序，这可以是数百个周期。

最后，请注意上面的 none 与写入整个缓存行有关 - 如果您写入 1 个字节然后读取相同的字节，将其他 63 个字节留在缓存行未受影响。

有是一个类似于你提到的完整缓存行的效果，但它处理write combining writes，这是可用的通过将内存标记为写组合（而不是通常的回写）或使用 non-temporal 存储指令。 NT 指令主要针对写入不会很快被随后读取的内存，跳过 RFO 开销，并且可能不会转发到后续加载。

在 x64 上再次读取之前在未缓存的地址写入完整的缓存行

Writing a full cache line at an uncached address before reading it again on x64

cpu

caching

intel