写入组合缓冲区是否用于正常写入英特尔的 WB 内存区域?

Are write-combining buffers used for normal writes to WB memory regions on Intel?

写入组合缓冲区一直是 Intel CPU 的一项功能,至少可以追溯到 Pentium 4,可能更早。基本思想是这些缓存行大小的缓冲区收集对同一缓存行的写入,因此它们可以作为一个单元处理。作为它们对软件性能的影响的一个例子,如果你不写完整的缓存行,你可能会遇到 reduced performance.

例如,在Intel 64 and IA-32 Architectures Optimization Reference Manual部分“3.6.10 Write Combining”开始于以下描述(强调):

Write combining (WC) improves performance in two ways:

• On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line.

• Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial writes to uncached memory.

There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with a CPUID signature of family encoding 15, model encoding 3; there are 8 write-combining buffers). Two of these buffers may be written out to higher cache levels and freed up for use on other write misses. Only four write- combining buffers are guaranteed to be available for simultaneous use. Write combining applies to memory type WC; it does not apply to memory type UC.

There are six write-combining buffers in each processor core in Intel Core Duo and Intel Core Solo processors. Processors based on Intel Core microarchitecture have eight write-combining buffers in each core. Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for write- combining.

Write combining buffers are used for stores of all memory types. They are particularly important for writes to uncached memory ...

我的问题是写组合是否适用于 WB 内存区域(即您在用户程序中 99.99% 的时间使用的 "normal" 内存),当使用普通存储时(除了非临时存储之外的任何东西)商店,即您 99.99% 的时间都在使用的商店)。

上面的文字很难准确解释,而且自从酷睿时代以来就没有更新过。您有写组合 "applies to WC memory but not UC" 的部分,但当然会排除所有其他类型,例如 WB。稍后你会发现“[WC] 对于写入未缓存的内存特别重要”,这似乎与 "doesn't apply to UC part" 相矛盾。

现代英特尔芯片上使用的写入组合缓冲区是否用于正常存储到 WB 内存?

是的,LFB 的写入合并和合并属性支持除 UC 类型之外的所有内存类型。您可以使用以下程序通过实验观察它们的影响。它需要两个参数作为输入:

  • STORE_COUNT:按顺序执行的 8 字节存储数。
  • INCREMENT: 连续存储之间的步幅。

INCREMENT有4个不同的值特别有意思:

  • 64:所有存储都在唯一的缓存行上执行。写入合并和合并不会生效。
  • 0:所有存储都在同一缓存行和该行内的相同位置。写入合并在这种情况下生效。
  • 8:每8个连续存储到同一个缓存行,但在该行中的不同位置。写入合并在这种情况下生效。
  • 4:连续存储的目标位置在同一缓存行内重叠。一些商店可能跨越两个缓存行(取决于 STORE_COUNT)。 write combining和coalescing都会生效

还有一个参数,ITERATIONS,用来多次重复同一个实验,以做出可靠的测量。你可以保持在 1000。

%define ITERATIONS 1000

BITS 64
DEFAULT REL

section .bss
align 64
bufsrc:     resb STORE_COUNT*64

section .text
global _start
_start:  
    mov ecx, ITERATIONS

.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.flush:
    clflush [rsi]
    sfence
    lfence
    add rsi, 64
    sub edx, 1
    jnz .flush

; This is the main loop where the stores are issued sequentially.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.inner:
    mov [rsi], rdx
    sfence ; Prevents potential combining in the store buffer.
    add rsi, INCREMENT
    sub edx, 1
    jnz .inner

; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
    mov edx, 100000
.wait:
    lfence
    sub edx, 1
    jnz .wait

    sub ecx, 1
    jnz .loop

; Exit.    
    xor edi,edi
    mov eax,231
    syscall

我推荐以下设置:

  • 使用 sudo wrmsr -a 0x1A4 0xf 禁用所有硬件预取器。这确保他们不会干扰(或干扰最小)实验。
  • 将CPU频率设为最大。这增加了主循环在第一个缓存行到达 L1 之前完全执行并导致释放 LFB 的可能性。
  • 禁用超线程,因为 LFB 是共享的(至少从 Sandy Bridge 开始,但不是在所有微体系结构上)。

L1D_PEND_MISS.FB_FULL 性能计数器使我们能够捕获写入组合的影响,了解它如何影响 LFB 的可用性。它在 Intel Core 及更高版本上受支持。说明如下:

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

首先 运行 没有内循环的代码并确保 L1D_PEND_MISS.FB_FULL 为零,这意味着刷新循环对事件计数没有影响。

下图绘制了 STORE_COUNT 与总数 L1D_PEND_MISS.FB_FULL 除以 ITERATIONS

我们可以观察到以下几点:

  • 很明显正好有 10 个 LFB。
  • 当写入组合或合并成为可能时,L1D_PEND_MISS.FB_FULL 对于任意数量的存储都是零。
  • 当stride为64字节时,L1D_PEND_MISS.FB_FULL在store数大于10时大于0

Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".

WC 和 UC 都被归类为不可缓存。所以可以把这两个说法放在一起推导出WC对写WC内存特别重要

另请参阅: