为什么 GCC 在 Clang 不使用它的地方插入 mfence?
Why do GCC inserts mfence where Clang dont use it?
为什么 GCC 和 Clang 会为此代码生成如此不同的 asm (x86_64, -O3 -std=c++17)?
#include <atomic>
int global_var = 0;
int foo_seq_cst(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_seq_cst);
return ia.load(std::memory_order_seq_cst);
}
int foo_relaxed(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_relaxed);
return ia.load(std::memory_order_relaxed);
}
海湾合作委员会 9.1:
foo_seq_cst(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mfence
mov eax, DWORD PTR [rsp-4]
ret
foo_relaxed(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mov eax, DWORD PTR [rsp-4]
ret
叮当声 8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
foo_relaxed(int): # @foo_relaxed(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
我怀疑这里的 mfence 有点矫枉过正,对吗?或者 Clang 生成的代码在某些情况下会导致错误?
更现实example:
#include <atomic>
std::atomic<int> a;
void foo_seq_cst(int b) {
a = b;
}
void foo_relaxed(int b) {
a.store(b, std::memory_order_relaxed);
}
gcc-9.1:
foo_seq_cst(int):
mov DWORD PTR a[rip], edi
mfence
ret
foo_relaxed(int):
mov DWORD PTR a[rip], edi
ret
clang-8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
xchg dword ptr [rip + a], edi
ret
foo_relaxed(int): # @foo_relaxed(int)
mov dword ptr [rip + a], edi
ret
gcc 使用 mfence
,而 clang 使用 xchg
作为 std::memory_order_seq_cst
。
xchg
表示 lock
前缀。 lock
和mfence
都满足std::memory_order_seq_cst
的要求,即无重排序和全排序。
来自 Intel 64 和 IA-32 架构软件开发人员手册:
MFENCE—Memory Fence
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior
the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes
the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows
the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other
MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID
instruction). MFENCE does not serialize the instruction stream.
8.2.3.8 Locked Instructions Have a Total Order
The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned.
8.2.3.9
Loads and Stores Are Not Reordered with Locked Instructions
The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute
earlier or later.
lock
was benchmarked to be 2-3x faster than mfence
和 Linux 在可能的情况下从 mfence
切换到 lock
。
为什么 GCC 和 Clang 会为此代码生成如此不同的 asm (x86_64, -O3 -std=c++17)?
#include <atomic>
int global_var = 0;
int foo_seq_cst(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_seq_cst);
return ia.load(std::memory_order_seq_cst);
}
int foo_relaxed(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_relaxed);
return ia.load(std::memory_order_relaxed);
}
海湾合作委员会 9.1:
foo_seq_cst(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mfence
mov eax, DWORD PTR [rsp-4]
ret
foo_relaxed(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mov eax, DWORD PTR [rsp-4]
ret
叮当声 8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
foo_relaxed(int): # @foo_relaxed(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
我怀疑这里的 mfence 有点矫枉过正,对吗?或者 Clang 生成的代码在某些情况下会导致错误?
更现实example:
#include <atomic>
std::atomic<int> a;
void foo_seq_cst(int b) {
a = b;
}
void foo_relaxed(int b) {
a.store(b, std::memory_order_relaxed);
}
gcc-9.1:
foo_seq_cst(int):
mov DWORD PTR a[rip], edi
mfence
ret
foo_relaxed(int):
mov DWORD PTR a[rip], edi
ret
clang-8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
xchg dword ptr [rip + a], edi
ret
foo_relaxed(int): # @foo_relaxed(int)
mov dword ptr [rip + a], edi
ret
gcc 使用 mfence
,而 clang 使用 xchg
作为 std::memory_order_seq_cst
。
xchg
表示 lock
前缀。 lock
和mfence
都满足std::memory_order_seq_cst
的要求,即无重排序和全排序。
来自 Intel 64 和 IA-32 架构软件开发人员手册:
MFENCE—Memory Fence
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.
8.2.3.8 Locked Instructions Have a Total Order
The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned.
8.2.3.9 Loads and Stores Are Not Reordered with Locked Instructions
The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later.
lock
was benchmarked to be 2-3x faster than mfence
和 Linux 在可能的情况下从 mfence
切换到 lock
。