写入组合缓冲区是否用于正常写入英特尔的 WB 内存区域？

Question

写入组合缓冲区一直是 Intel CPU 的一项功能，至少可以追溯到 Pentium 4，可能更早。基本思想是这些缓存行大小的缓冲区收集对同一缓存行的写入，因此它们可以作为一个单元处理。作为它们对软件性能的影响的一个例子，如果你不写完整的缓存行，你可能会遇到 reduced performance.

例如，在Intel 64 and IA-32 Architectures Optimization Reference Manual部分“3.6.10 Write Combining”开始于以下描述（强调）：

Write combining (WC) improves performance in two ways:

• On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line.

• Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial writes to uncached memory.

There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with a CPUID signature of family encoding 15, model encoding 3; there are 8 write-combining buffers). Two of these buffers may be written out to higher cache levels and freed up for use on other write misses. Only four write- combining buffers are guaranteed to be available for simultaneous use. Write combining applies to memory type WC; it does not apply to memory type UC.

There are six write-combining buffers in each processor core in Intel Core Duo and Intel Core Solo processors. Processors based on Intel Core microarchitecture have eight write-combining buffers in each core. Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for write- combining.

Write combining buffers are used for stores of all memory types. They are particularly important for writes to uncached memory ...

我的问题是写组合是否适用于 WB 内存区域（即您在用户程序中 99.99% 的时间使用的 "normal" 内存），当使用普通存储时（除了非临时存储之外的任何东西）商店，即您 99.99% 的时间都在使用的商店）。

上面的文字很难准确解释，而且自从酷睿时代以来就没有更新过。您有写组合 "applies to WC memory but not UC" 的部分，但当然会排除所有其他类型，例如 WB。稍后你会发现“[WC] 对于写入未缓存的内存特别重要”，这似乎与 "doesn't apply to UC part" 相矛盾。

现代英特尔芯片上使用的写入组合缓冲区是否用于正常存储到 WB 内存？

Answer 1

是的，LFB 的写入合并和合并属性支持除 UC 类型之外的所有内存类型。您可以使用以下程序通过实验观察它们的影响。它需要两个参数作为输入：

STORE_COUNT：按顺序执行的 8 字节存储数。
INCREMENT: 连续存储之间的步幅。

INCREMENT有4个不同的值特别有意思：

64：所有存储都在唯一的缓存行上执行。写入合并和合并不会生效。
0：所有存储都在同一缓存行和该行内的相同位置。写入合并在这种情况下生效。
8：每8个连续存储到同一个缓存行，但在该行中的不同位置。写入合并在这种情况下生效。
4：连续存储的目标位置在同一缓存行内重叠。一些商店可能跨越两个缓存行（取决于 STORE_COUNT）。 write combining和coalescing都会生效

还有一个参数，ITERATIONS，用来多次重复同一个实验，以做出可靠的测量。你可以保持在 1000。

%define ITERATIONS 1000

BITS 64
DEFAULT REL

section .bss
align 64
bufsrc:     resb STORE_COUNT*64

section .text
global _start
_start:  
    mov ecx, ITERATIONS

.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.flush:
    clflush [rsi]
    sfence
    lfence
    add rsi, 64
    sub edx, 1
    jnz .flush

; This is the main loop where the stores are issued sequentially.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.inner:
    mov [rsi], rdx
    sfence ; Prevents potential combining in the store buffer.
    add rsi, INCREMENT
    sub edx, 1
    jnz .inner

; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
    mov edx, 100000
.wait:
    lfence
    sub edx, 1
    jnz .wait

    sub ecx, 1
    jnz .loop

; Exit.    
    xor edi,edi
    mov eax,231
    syscall

我推荐以下设置：

使用 sudo wrmsr -a 0x1A4 0xf 禁用所有硬件预取器。这确保他们不会干扰（或干扰最小）实验。
将CPU频率设为最大。这增加了主循环在第一个缓存行到达 L1 之前完全执行并导致释放 LFB 的可能性。
禁用超线程，因为 LFB 是共享的（至少从 Sandy Bridge 开始，但不是在所有微体系结构上）。

L1D_PEND_MISS.FB_FULL 性能计数器使我们能够捕获写入组合的影响，了解它如何影响 LFB 的可用性。它在 Intel Core 及更高版本上受支持。说明如下：

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

首先运行没有内循环的代码并确保 L1D_PEND_MISS.FB_FULL 为零，这意味着刷新循环对事件计数没有影响。

下图绘制了 STORE_COUNT 与总数 L1D_PEND_MISS.FB_FULL 除以 ITERATIONS。

我们可以观察到以下几点：

很明显正好有 10 个 LFB。
当写入组合或合并成为可能时，L1D_PEND_MISS.FB_FULL 对于任意数量的存储都是零。
当stride为64字节时，L1D_PEND_MISS.FB_FULL在store数大于10时大于0

Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".

WC 和 UC 都被归类为不可缓存。所以可以把这两个说法放在一起推导出WC对写WC内存特别重要

另请参阅：。

写入组合缓冲区是否用于正常写入英特尔的 WB 内存区域？

Are write-combining buffers used for normal writes to WB memory regions on Intel?

performance

x86

intel

cpu-architecture