C++ 标准是否强制局部变量的引用捕获效率低下?
Does the C++ standard force capture-by-reference of local variables to be inefficient?
我最近需要一个通过引用捕获多个局部变量的 lambda,所以我做了一个测试片段来研究它的效率,并使用 clang 3.6 -O3
编译它:
void do_something_with(void*);
void test()
{
int a = 0, b = 0, c = 0;
auto func = [&] () {
a++;
b++;
c++;
};
do_something_with((void*)&func);
}
movl [=11=]x0,0x24(%rsp)
movl [=11=]x0,0x20(%rsp)
movl [=11=]x0,0x1c(%rsp)
lea 0x24(%rsp),%rax
mov %rax,(%rsp)
lea 0x20(%rsp),%rax
mov %rax,0x8(%rsp)
lea 0x1c(%rsp),%rax
mov %rax,0x10(%rsp)
lea (%rsp),%rdi
callq ...
显然 lambda 只需要其中一个变量的地址,所有其他变量都可以通过相对寻址从中获得。
相反,编译器在堆栈上创建了一个结构,其中包含指向 each 局部变量的指针,然后将结构的地址传递给 lambda。这和我写的差不多:
int a = 0, b = 0, c = 0;
struct X
{
int *pa, *pb, *pc;
};
X x = {&a, &b, &c};
auto func = [p = &x] () {
(*p->pa)++;
(*p->pb)++;
(*p->pc)++;
};
出于各种原因,这是低效的,但最令人担忧的是,如果捕获了太多变量,它可能会导致堆分配。
我的问题:
事实上,clang 和 gcc 在 -O3
上都这样做,这让我怀疑标准中的某些内容实际上强制闭包的实现效率低下。是这样吗?
如果是这样,那是出于什么原因?它不能用于编译器之间 lambda 的二进制兼容性,因为任何知道 lambda 类型的代码都保证位于同一个翻译单元中。
如果不是,那么为什么两个主要编译器都缺少此优化?
编辑:
这是我希望从编译器中看到的更高效代码的示例。此代码使用更少的堆栈 space,lambda 现在仅执行一次指针间接寻址而不是两次,并且 lambda 的大小不会随着捕获变量的数量而增长:
struct X
{
int a = 0, b = 0, c = 0;
} x;
auto func = [&x] () {
x.a++;
x.b++;
x.c++;
};
movl [=14=]x0,0x8(%rsp)
movl [=14=]x0,0xc(%rsp)
movl [=14=]x0,0x10(%rsp)
lea 0x8(%rsp),%rax
mov %rax,(%rsp)
lea (%rsp),%rdi
callq ...
这看起来像是未指明的行为。 C++14 draft standard: N3936 部分 5.1.2
Lambda 表达式 [expr.prim.lambda] 中的以下段落让我这样认为:
An entity is captured by reference if it is implicitly or explicitly
captured but not captured by copy. It is unspecified whether
additional unnamed non-static data members are declared in the closure
type for entities captured by reference. [...]
副本捕获的实体不同:
Every id-expression within the compound-statement of a
lambda-expression that is an odr-use (3.2) of an entity captured by
copy is transformed into an access to the corresponding unnamed data
member of the closure type.
感谢 dyp 指出了一些我不知何故遗漏的相关文件。看起来 defect report 750: Implementation constraints on reference-only closure objects 提供了当前措辞的基本原理,它说:
Consider an example like:
void f(vector<double> vec) {
double x, y, z;
fancy_algorithm(vec, [&]() { /* use x, y, and z in various ways */ });
}
5.1.2 [expr.prim.lambda] paragraph 8 requires that the closure class for this lambda will have three reference members, and paragraph 12
requires that it be derived from std::reference_closure, implying two
additional pointer members. Although 8.3.2 [dcl.ref] paragraph 4
allows a reference to be implemented without allocation of storage,
current ABIs require that references be implemented as pointers. The
practical effect of these requirements is that the closure object for
this lambda expression will contain five pointers. If not for these
requirements, however, it would be possible to implement the closure
object as a single pointer to the stack frame, generating data
accesses in the function-call operator as offsets relative to the
frame pointer. The current specification is too tightly constrained.
这与您关于允许潜在优化的确切观点相呼应,并作为 N2927 的一部分实施,其中包括以下内容:
The new wording no longer specifies any rewrite or closure members for "by reference" capture.
Uses of entities captured "by reference" affect the original entities, and the mechanism to
achieve this is left entirely to the implementation.
我最近需要一个通过引用捕获多个局部变量的 lambda,所以我做了一个测试片段来研究它的效率,并使用 clang 3.6 -O3
编译它:
void do_something_with(void*);
void test()
{
int a = 0, b = 0, c = 0;
auto func = [&] () {
a++;
b++;
c++;
};
do_something_with((void*)&func);
}
movl [=11=]x0,0x24(%rsp)
movl [=11=]x0,0x20(%rsp)
movl [=11=]x0,0x1c(%rsp)
lea 0x24(%rsp),%rax
mov %rax,(%rsp)
lea 0x20(%rsp),%rax
mov %rax,0x8(%rsp)
lea 0x1c(%rsp),%rax
mov %rax,0x10(%rsp)
lea (%rsp),%rdi
callq ...
显然 lambda 只需要其中一个变量的地址,所有其他变量都可以通过相对寻址从中获得。
相反,编译器在堆栈上创建了一个结构,其中包含指向 each 局部变量的指针,然后将结构的地址传递给 lambda。这和我写的差不多:
int a = 0, b = 0, c = 0;
struct X
{
int *pa, *pb, *pc;
};
X x = {&a, &b, &c};
auto func = [p = &x] () {
(*p->pa)++;
(*p->pb)++;
(*p->pc)++;
};
出于各种原因,这是低效的,但最令人担忧的是,如果捕获了太多变量,它可能会导致堆分配。
我的问题:
事实上,clang 和 gcc 在
-O3
上都这样做,这让我怀疑标准中的某些内容实际上强制闭包的实现效率低下。是这样吗?如果是这样,那是出于什么原因?它不能用于编译器之间 lambda 的二进制兼容性,因为任何知道 lambda 类型的代码都保证位于同一个翻译单元中。
如果不是,那么为什么两个主要编译器都缺少此优化?
编辑:
这是我希望从编译器中看到的更高效代码的示例。此代码使用更少的堆栈 space,lambda 现在仅执行一次指针间接寻址而不是两次,并且 lambda 的大小不会随着捕获变量的数量而增长:
struct X
{
int a = 0, b = 0, c = 0;
} x;
auto func = [&x] () {
x.a++;
x.b++;
x.c++;
};
movl [=14=]x0,0x8(%rsp)
movl [=14=]x0,0xc(%rsp)
movl [=14=]x0,0x10(%rsp)
lea 0x8(%rsp),%rax
mov %rax,(%rsp)
lea (%rsp),%rdi
callq ...
这看起来像是未指明的行为。 C++14 draft standard: N3936 部分 5.1.2
Lambda 表达式 [expr.prim.lambda] 中的以下段落让我这样认为:
An entity is captured by reference if it is implicitly or explicitly captured but not captured by copy. It is unspecified whether additional unnamed non-static data members are declared in the closure type for entities captured by reference. [...]
副本捕获的实体不同:
Every id-expression within the compound-statement of a lambda-expression that is an odr-use (3.2) of an entity captured by copy is transformed into an access to the corresponding unnamed data member of the closure type.
感谢 dyp 指出了一些我不知何故遗漏的相关文件。看起来 defect report 750: Implementation constraints on reference-only closure objects 提供了当前措辞的基本原理,它说:
Consider an example like:
void f(vector<double> vec) { double x, y, z; fancy_algorithm(vec, [&]() { /* use x, y, and z in various ways */ }); }
5.1.2 [expr.prim.lambda] paragraph 8 requires that the closure class for this lambda will have three reference members, and paragraph 12 requires that it be derived from std::reference_closure, implying two additional pointer members. Although 8.3.2 [dcl.ref] paragraph 4 allows a reference to be implemented without allocation of storage, current ABIs require that references be implemented as pointers. The practical effect of these requirements is that the closure object for this lambda expression will contain five pointers. If not for these requirements, however, it would be possible to implement the closure object as a single pointer to the stack frame, generating data accesses in the function-call operator as offsets relative to the frame pointer. The current specification is too tightly constrained.
这与您关于允许潜在优化的确切观点相呼应,并作为 N2927 的一部分实施,其中包括以下内容:
The new wording no longer specifies any rewrite or closure members for "by reference" capture. Uses of entities captured "by reference" affect the original entities, and the mechanism to achieve this is left entirely to the implementation.