写入组合缓冲区是否用于正常写入英特尔的 WB 内存区域?
Are write-combining buffers used for normal writes to WB memory regions on Intel?
写入组合缓冲区一直是 Intel CPU 的一项功能,至少可以追溯到 Pentium 4,可能更早。基本思想是这些缓存行大小的缓冲区收集对同一缓存行的写入,因此它们可以作为一个单元处理。作为它们对软件性能的影响的一个例子,如果你不写完整的缓存行,你可能会遇到 reduced performance.
例如,在Intel 64 and IA-32 Architectures Optimization Reference Manual部分“3.6.10 Write Combining”开始于以下描述(强调):
Write combining (WC) improves performance in two ways:
• On a write
miss to the first-level cache, it allows multiple stores to the same
cache line to occur before that cache line is read for ownership (RFO)
from further out in the cache/memory hierarchy. Then the rest of line
is read, and the bytes that have not been written are combined with
the unmodified bytes in the returned line.
• Write combining allows
multiple writes to be assembled and written further out in the cache
hierarchy as a unit. This saves port and bus traffic. Saving traffic
is particularly important for avoiding partial writes to uncached
memory.
There are six write-combining buffers (on Pentium 4 and Intel
Xeon processors with a CPUID signature of family encoding 15, model
encoding 3; there are 8 write-combining buffers). Two of these buffers
may be written out to higher cache levels and freed up for use on
other write misses. Only four write- combining buffers are guaranteed
to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six
write-combining buffers in each processor core in Intel Core Duo and
Intel Core Solo processors. Processors based on Intel Core
microarchitecture have eight write-combining buffers in each core.
Starting with Intel microarchitecture code name Nehalem, there are 10
buffers available for write- combining.
Write combining buffers
are used for stores of all memory types. They are particularly
important for writes to uncached memory ...
我的问题是写组合是否适用于 WB 内存区域(即您在用户程序中 99.99% 的时间使用的 "normal" 内存),当使用普通存储时(除了非临时存储之外的任何东西)商店,即您 99.99% 的时间都在使用的商店)。
上面的文字很难准确解释,而且自从酷睿时代以来就没有更新过。您有写组合 "applies to WC memory but not UC" 的部分,但当然会排除所有其他类型,例如 WB。稍后你会发现“[WC] 对于写入未缓存的内存特别重要”,这似乎与 "doesn't apply to UC part" 相矛盾。
现代英特尔芯片上使用的写入组合缓冲区是否用于正常存储到 WB 内存?
是的,LFB 的写入合并和合并属性支持除 UC 类型之外的所有内存类型。您可以使用以下程序通过实验观察它们的影响。它需要两个参数作为输入:
STORE_COUNT
:按顺序执行的 8 字节存储数。
INCREMENT
: 连续存储之间的步幅。
INCREMENT
有4个不同的值特别有意思:
- 64:所有存储都在唯一的缓存行上执行。写入合并和合并不会生效。
- 0:所有存储都在同一缓存行和该行内的相同位置。写入合并在这种情况下生效。
- 8:每8个连续存储到同一个缓存行,但在该行中的不同位置。写入合并在这种情况下生效。
- 4:连续存储的目标位置在同一缓存行内重叠。一些商店可能跨越两个缓存行(取决于
STORE_COUNT
)。 write combining和coalescing都会生效
还有一个参数,ITERATIONS
,用来多次重复同一个实验,以做出可靠的测量。你可以保持在 1000。
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
我推荐以下设置:
- 使用
sudo wrmsr -a 0x1A4 0xf
禁用所有硬件预取器。这确保他们不会干扰(或干扰最小)实验。
- 将CPU频率设为最大。这增加了主循环在第一个缓存行到达 L1 之前完全执行并导致释放 LFB 的可能性。
- 禁用超线程,因为 LFB 是共享的(至少从 Sandy Bridge 开始,但不是在所有微体系结构上)。
L1D_PEND_MISS.FB_FULL
性能计数器使我们能够捕获写入组合的影响,了解它如何影响 LFB 的可用性。它在 Intel Core 及更高版本上受支持。说明如下:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
首先 运行 没有内循环的代码并确保 L1D_PEND_MISS.FB_FULL
为零,这意味着刷新循环对事件计数没有影响。
下图绘制了 STORE_COUNT
与总数 L1D_PEND_MISS.FB_FULL
除以 ITERATIONS
。
我们可以观察到以下几点:
- 很明显正好有 10 个 LFB。
- 当写入组合或合并成为可能时,
L1D_PEND_MISS.FB_FULL
对于任意数量的存储都是零。
- 当stride为64字节时,
L1D_PEND_MISS.FB_FULL
在store数大于10时大于0
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
WC 和 UC 都被归类为不可缓存。所以可以把这两个说法放在一起推导出WC对写WC内存特别重要
另请参阅:。
写入组合缓冲区一直是 Intel CPU 的一项功能,至少可以追溯到 Pentium 4,可能更早。基本思想是这些缓存行大小的缓冲区收集对同一缓存行的写入,因此它们可以作为一个单元处理。作为它们对软件性能的影响的一个例子,如果你不写完整的缓存行,你可能会遇到 reduced performance.
例如,在Intel 64 and IA-32 Architectures Optimization Reference Manual部分“3.6.10 Write Combining”开始于以下描述(强调):
Write combining (WC) improves performance in two ways:
• On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line.
• Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial writes to uncached memory.
There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with a CPUID signature of family encoding 15, model encoding 3; there are 8 write-combining buffers). Two of these buffers may be written out to higher cache levels and freed up for use on other write misses. Only four write- combining buffers are guaranteed to be available for simultaneous use. Write combining applies to memory type WC; it does not apply to memory type UC.
There are six write-combining buffers in each processor core in Intel Core Duo and Intel Core Solo processors. Processors based on Intel Core microarchitecture have eight write-combining buffers in each core. Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for write- combining.
Write combining buffers are used for stores of all memory types. They are particularly important for writes to uncached memory ...
我的问题是写组合是否适用于 WB 内存区域(即您在用户程序中 99.99% 的时间使用的 "normal" 内存),当使用普通存储时(除了非临时存储之外的任何东西)商店,即您 99.99% 的时间都在使用的商店)。
上面的文字很难准确解释,而且自从酷睿时代以来就没有更新过。您有写组合 "applies to WC memory but not UC" 的部分,但当然会排除所有其他类型,例如 WB。稍后你会发现“[WC] 对于写入未缓存的内存特别重要”,这似乎与 "doesn't apply to UC part" 相矛盾。
现代英特尔芯片上使用的写入组合缓冲区是否用于正常存储到 WB 内存?
是的,LFB 的写入合并和合并属性支持除 UC 类型之外的所有内存类型。您可以使用以下程序通过实验观察它们的影响。它需要两个参数作为输入:
STORE_COUNT
:按顺序执行的 8 字节存储数。INCREMENT
: 连续存储之间的步幅。
INCREMENT
有4个不同的值特别有意思:
- 64:所有存储都在唯一的缓存行上执行。写入合并和合并不会生效。
- 0:所有存储都在同一缓存行和该行内的相同位置。写入合并在这种情况下生效。
- 8:每8个连续存储到同一个缓存行,但在该行中的不同位置。写入合并在这种情况下生效。
- 4:连续存储的目标位置在同一缓存行内重叠。一些商店可能跨越两个缓存行(取决于
STORE_COUNT
)。 write combining和coalescing都会生效
还有一个参数,ITERATIONS
,用来多次重复同一个实验,以做出可靠的测量。你可以保持在 1000。
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
我推荐以下设置:
- 使用
sudo wrmsr -a 0x1A4 0xf
禁用所有硬件预取器。这确保他们不会干扰(或干扰最小)实验。 - 将CPU频率设为最大。这增加了主循环在第一个缓存行到达 L1 之前完全执行并导致释放 LFB 的可能性。
- 禁用超线程,因为 LFB 是共享的(至少从 Sandy Bridge 开始,但不是在所有微体系结构上)。
L1D_PEND_MISS.FB_FULL
性能计数器使我们能够捕获写入组合的影响,了解它如何影响 LFB 的可用性。它在 Intel Core 及更高版本上受支持。说明如下:
Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.
首先 运行 没有内循环的代码并确保 L1D_PEND_MISS.FB_FULL
为零,这意味着刷新循环对事件计数没有影响。
下图绘制了 STORE_COUNT
与总数 L1D_PEND_MISS.FB_FULL
除以 ITERATIONS
。
我们可以观察到以下几点:
- 很明显正好有 10 个 LFB。
- 当写入组合或合并成为可能时,
L1D_PEND_MISS.FB_FULL
对于任意数量的存储都是零。 - 当stride为64字节时,
L1D_PEND_MISS.FB_FULL
在store数大于10时大于0
Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".
WC 和 UC 都被归类为不可缓存。所以可以把这两个说法放在一起推导出WC对写WC内存特别重要
另请参阅: