为什么间接增量比直接增量快?
Why indirect increment is faster than direct inc?
该问题已被另一位 SO 成员提出,但令人失望地被删除了。评论说测量有缺陷,没有意义。
但是我能够在 JMH:
下用一个小的基准重现原来的问题
package bench;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.*;
import org.openjdk.jmh.runner.options.*;
import java.util.concurrent.*;
@State(Scope.Benchmark)
public class LoopInc {
private int getValue() {
return ThreadLocalRandom.current().nextInt(2);
}
@Benchmark
public int directInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
switch (getValue()) {
case 0:
break;
case 1:
result++;
break;
}
}
return result;
}
@Benchmark
public int indirectInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
boolean incr = false;
switch (getValue()) {
case 0:
break;
case 1:
incr = true;
break;
}
if (incr) {
result++;
}
}
return result;
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include("bench.LoopInc.*")
.warmupIterations(5)
.measurementIterations(10)
.forks(3)
.timeUnit(TimeUnit.MILLISECONDS)
.build();
new Runner(options).run();
}
}
基准测试显示 indirectInc
工作速度快 3 倍,尽管 "optimization" 一点也不明显。人们会认为 indirectInc
应该工作得慢一点,因为它涉及一个额外的中间操作。
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 30 127,301 ± 0,202 ops/ms
LoopInc.indirectInc thrpt 30 378,147 ± 1,144 ops/ms
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
是什么导致 JIT 编译 indirectInc
优于类似的 directInc
?
好的,这就是你处理这些事情的方式。
尝试重现。好的,它重现了:
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 15 175.678 ± 1.118 ops/ms
LoopInc.indirectInc thrpt 15 641.413 ± 9.722 ops/ms
尝试用-prof perfasm
查看生成的程序集。长话短说,它会生成大量代码,因此我们可能希望限制循环展开。但是,它会影响性能,并且很可能是原因。所以,让我们用-XX:LoopUnrollLimit=1
重新运行。好吧,分数低了,但是差距还是有的,优秀:
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 15 161.147 ± 6.101 ops/ms
LoopInc.indirectInc thrpt 15 489.430 ± 1.698 ops/ms
再看一下生成的代码,仍然没有什么跳出来的东西。嗯,这看起来很有趣。让我们好好谈谈吧。我们可以描述工作量吗?当然,我们可以在 -prof perfnorm
的帮助下对每个基准操作的硬件计数器进行标准化。让我们看看:
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 15 161.875 ± 3.038 ops/ms
LoopInc.directInc:·CPI thrpt 3 0.967 ± 0.196 #/op
LoopInc.directInc:·L1-dcache-load-misses thrpt 3 0.394 ± 3.663 #/op
LoopInc.directInc:·L1-dcache-loads thrpt 3 2149.594 ± 228.166 #/op
LoopInc.directInc:·L1-dcache-store-misses thrpt 3 0.114 ± 1.001 #/op
LoopInc.directInc:·L1-dcache-stores thrpt 3 1073.666 ± 96.066 #/op
LoopInc.directInc:·L1-icache-load-misses thrpt 3 0.965 ± 22.984 #/op
LoopInc.directInc:·LLC-loads thrpt 3 0.204 ± 2.763 #/op
LoopInc.directInc:·LLC-stores thrpt 3 0.060 ± 0.633 #/op
LoopInc.directInc:·branch-misses thrpt 3 536.068 ± 43.293 #/op
LoopInc.directInc:·branches thrpt 3 3728.890 ± 220.539 #/op
LoopInc.directInc:·cycles thrpt 3 26219.146 ± 6287.590 #/op
LoopInc.directInc:·dTLB-load-misses thrpt 3 0.063 ± 0.124 #/op
LoopInc.directInc:·dTLB-loads thrpt 3 2136.942 ± 165.990 #/op
LoopInc.directInc:·dTLB-store-misses thrpt 3 0.022 ± 0.029 #/op
LoopInc.directInc:·dTLB-stores thrpt 3 1084.787 ± 417.281 #/op
LoopInc.directInc:·iTLB-load-misses thrpt 3 0.081 ± 0.333 #/op
LoopInc.directInc:·iTLB-loads thrpt 3 3.623 ± 19.955 #/op
LoopInc.directInc:·instructions thrpt 3 27114.052 ± 1843.720 #/op
LoopInc.indirectInc thrpt 15 489.164 ± 2.692 ops/ms
LoopInc.indirectInc:·CPI thrpt 3 0.281 ± 0.015 #/op
LoopInc.indirectInc:·L1-dcache-load-misses thrpt 3 0.503 ± 9.071 #/op
LoopInc.indirectInc:·L1-dcache-loads thrpt 3 2149.806 ± 369.040 #/op
LoopInc.indirectInc:·L1-dcache-store-misses thrpt 3 0.167 ± 1.370 #/op
LoopInc.indirectInc:·L1-dcache-stores thrpt 3 1073.895 ± 186.741 #/op
LoopInc.indirectInc:·L1-icache-load-misses thrpt 3 0.313 ± 1.275 #/op
LoopInc.indirectInc:·branch-misses thrpt 3 1.102 ± 0.375 #/op
LoopInc.indirectInc:·branches thrpt 3 2143.670 ± 228.475 #/op
LoopInc.indirectInc:·cycles thrpt 3 8701.665 ± 706.183 #/op
LoopInc.indirectInc:·dTLB-load-misses thrpt 3 0.020 ± 0.301 #/op
LoopInc.indirectInc:·dTLB-loads thrpt 3 2141.965 ± 135.852 #/op
LoopInc.indirectInc:·dTLB-store-misses thrpt 3 0.002 ± 0.029 #/op
LoopInc.indirectInc:·dTLB-stores thrpt 3 1070.376 ± 81.445 #/op
LoopInc.indirectInc:·iTLB-load-misses thrpt 3 0.007 ± 0.135 #/op
LoopInc.indirectInc:·iTLB-loads thrpt 3 0.310 ± 5.768 #/op
LoopInc.indirectInc:·instructions thrpt 3 30968.207 ± 3627.540 #/op
哦,两个基准测试的指令数量相当。较慢的需要更多的周期(这就是为什么 CPI 在 directInc
中也不理想;但是,indirectInc
会产生接近理想的 CPI)。如果您仔细查看可能的原因:没有太多缓存未命中,也没有太多 TLB 未命中,但是慢速基准测试有很多分支未命中。啊哈!现在我们知道要在生成的代码中查看什么了。
让我们再看看生成的代码。 -prof perfasm
方便地突出显示跳跃。然后你会看到这个...
directInc:
╭│ 0x00007fa0a82a50ff: jmp 0x00007fa0a82a5116
11.39% 16.90% ││ ↗ 0x00007fa0a82a5101: inc %edx ;*iinc
││ │ ; - org.openjdk.LoopInc::directInc@46 (line 18)
12.52% 23.11% ││ │↗↗ 0x00007fa0a82a5103: mov %r10,0xe8(%r11) ;*invokevirtual putLong
││ │││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
12.00% 8.14% ││ │││ 0x00007fa0a82a510a: inc %r8d ;*iinc
││ │││ ; - org.openjdk.LoopInc::directInc@46 (line 18)
0.03% 0.03% ││ │││ 0x00007fa0a82a510d: cmp [=13=]x3e8,%r8d
│╰ │││ 0x00007fa0a82a5114: jge 0x00007fa0a82a50c7 ;*aload_0
│ │││ ; - org.openjdk.LoopInc::directInc@11 (line 19)
0.80% 0.91% ↘ │││ 0x00007fa0a82a5116: mov 0xf0(%r11),%r10d ;*invokevirtual getInt
│││ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
4.28% 1.23% │││ 0x00007fa0a82a511d: test %r10d,%r10d
╭│││ 0x00007fa0a82a5120: je 0x00007fa0a82a517b ;*ifne
││││ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
2.11% 0.01% ││││ 0x00007fa0a82a5122: movabs [=13=]x9e3779b97f4a7c15,%r10
0.01% 0.07% ││││ 0x00007fa0a82a512c: add 0xe8(%r11),%r10 ;*ladd
││││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
7.73% 1.89% ││││ 0x00007fa0a82a5133: mov %r10,%r9
1.21% 1.84% ││││ 0x00007fa0a82a5136: shr [=13=]x21,%r9
1.90% 0.03% ││││ 0x00007fa0a82a513a: xor %r10,%r9
2.02% 0.03% ││││ 0x00007fa0a82a513d: movabs [=13=]xff51afd7ed558ccd,%rcx
0.94% 1.82% ││││ 0x00007fa0a82a5147: imul %rcx,%r9 ;*lmul
││││ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
7.01% 2.40% ││││ 0x00007fa0a82a514b: mov %r9,%rcx
││││ 0x00007fa0a82a514e: shr [=13=]x21,%rcx
1.89% 0.70% ││││ 0x00007fa0a82a5152: xor %r9,%rcx
3.11% 2.55% ││││ 0x00007fa0a82a5155: movabs [=13=]xc4ceb9fe1a85ec53,%r9
0.99% 1.50% ││││ 0x00007fa0a82a515f: imul %r9,%rcx
7.66% 2.89% ││││ 0x00007fa0a82a5163: shr [=13=]x20,%rcx
3.70% 1.97% ││││ 0x00007fa0a82a5167: mov %ecx,%r9d
0.11% ││││ 0x00007fa0a82a516a: and [=13=]x1,%r9d ;*iand
││││ ; - java.util.concurrent.ThreadLocalRandom::nextInt@34 (line 356)
3.76% 11.13% ││││ 0x00007fa0a82a516e: cmp [=13=]x1,%r9d
│╰││ 0x00007fa0a82a5172: je 0x00007fa0a82a5101
10.48% 16.62% │ ││ 0x00007fa0a82a5174: test %r9d,%r9d
│ ╰│ 0x00007fa0a82a5177: je 0x00007fa0a82a5103 ;*lookupswitch
│ │ ; - org.openjdk.LoopInc::directInc@15 (line 19)
│ ╰ 0x00007fa0a82a5179: jmp 0x00007fa0a82a5103 ;*aload_0
│ ; - org.openjdk.LoopInc::directInc@11 (line 19)
↘ 0x00007fa0a82a517b: mov [=13=]xffffff5d,%esi
indirectInc:
0.01% 0.01% ↗ 0x00007f65588d8260: mov %edx,%r9d
0.01% │ 0x00007f65588d8263: nopw 0x0(%rax,%rax,1)
11.99% 11.38% │ 0x00007f65588d826c: data16 data16 xchg %ax,%ax ;*iconst_0
│ ; - org.openjdk.LoopInc::indirectInc@11 (line 34)
│ 0x00007f65588d8270: mov 0xf0(%r8),%r10d ;*invokevirtual getInt
│ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
│ 0x00007f65588d8277: test %r10d,%r10d
│ 0x00007f65588d827a: je 0x00007f65588d8331 ;*ifne
│ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
0.01% │ 0x00007f65588d8280: movabs [=14=]x9e3779b97f4a7c15,%r10
11.80% 11.49% │ 0x00007f65588d828a: add 0xe8(%r8),%r10 ;*ladd
│ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
0.01% 0.01% │ 0x00007f65588d8291: mov %r10,0xe8(%r8) ;*invokevirtual putLong
│ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
│ 0x00007f65588d8298: mov %r9d,%edx
0.01% 0.01% │ 0x00007f65588d829b: inc %edx
11.12% 12.40% │ 0x00007f65588d829d: mov %r10,%rcx
0.01% │ 0x00007f65588d82a0: shr [=14=]x21,%rcx
0.03% │ 0x00007f65588d82a4: xor %r10,%rcx
0.06% 0.03% │ 0x00007f65588d82a7: movabs [=14=]xff51afd7ed558ccd,%r10
12.38% 13.94% │ 0x00007f65588d82b1: imul %r10,%rcx ;*lmul
│ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
0.03% 0.01% │ 0x00007f65588d82b5: mov %rcx,%r10
│ 0x00007f65588d82b8: shr [=14=]x21,%r10
0.03% │ 0x00007f65588d82bc: xor %rcx,%r10
11.43% 12.62% │ 0x00007f65588d82bf: movabs [=14=]xc4ceb9fe1a85ec53,%rcx
0.01% │ 0x00007f65588d82c9: imul %rcx,%r10
0.34% 0.30% │ 0x00007f65588d82cd: shr [=14=]x20,%r10
0.85% 0.76% │ 0x00007f65588d82d1: mov %r10d,%r10d
11.81% 11.51% │ 0x00007f65588d82d4: and [=14=]x1,%r10d
2.16% 1.78% │ 0x00007f65588d82d8: cmp [=14=]x1,%r10d
3.45% 3.00% │ 0x00007f65588d82dc: cmovne %r9d,%edx <----- HERE IT IS
17.55% 15.86% │ 0x00007f65588d82e0: inc %r11d ;*iinc
│ ; - org.openjdk.LoopInc::indirectInc@56 (line 33)
│ 0x00007f65588d82e3: cmp [=14=]x3e8,%r11d
╰ 0x00007f65588d82ea: jl 0x00007f65588d8260 ;*if_icmpge
; - org.openjdk.LoopInc::indirectInc@8 (line 33)
注意 cmovne
而不是 jmp
-- 这就是我们有更多 "predictable" 分支的原因。 HotSpot 分析分支,并在分支分析分支非常平坦时发出条件移动。换句话说,通过为条件移动增加的延迟付出一点代价来避免很可能发生的分支预测错误。然而,在这种情况下,switch 是特殊的:它有 两个以上 选项(0、1 和 "nothing")。这就是为什么,我推测,result
增量没有被折叠到 cmov 中。 (一般来说,HotSpot 可以在 "default" 中存储零到 result
,但它吹了,哦好吧)
为了证实这个假设,让我们做一个directCompleteInc
案例,我们仍然使用switch
,但现在涵盖所有案例:
@Benchmark
public int directCompleteInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
switch (getValue()) {
case 1:
result++;
break;
default:
break;
}
}
return result;
}
...并测量它,这次没有任何选项,就像 OP 所做的那样:
Benchmark Mode Cnt Score Error Units
LoopInc.directCompleteInc thrpt 5 644.414 ± 0.371 ops/ms
LoopInc.directInc thrpt 5 174.974 ± 0.103 ops/ms
LoopInc.indirectInc thrpt 5 644.015 ± 0.533 ops/ms
那里。
确认 directCompleteInc
正在使用 cmov
和 -prof perfasm
。确实如此。
喝一口。
该问题已被另一位 SO 成员提出,但令人失望地被删除了。评论说测量有缺陷,没有意义。
但是我能够在 JMH:
下用一个小的基准重现原来的问题package bench;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.*;
import org.openjdk.jmh.runner.options.*;
import java.util.concurrent.*;
@State(Scope.Benchmark)
public class LoopInc {
private int getValue() {
return ThreadLocalRandom.current().nextInt(2);
}
@Benchmark
public int directInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
switch (getValue()) {
case 0:
break;
case 1:
result++;
break;
}
}
return result;
}
@Benchmark
public int indirectInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
boolean incr = false;
switch (getValue()) {
case 0:
break;
case 1:
incr = true;
break;
}
if (incr) {
result++;
}
}
return result;
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include("bench.LoopInc.*")
.warmupIterations(5)
.measurementIterations(10)
.forks(3)
.timeUnit(TimeUnit.MILLISECONDS)
.build();
new Runner(options).run();
}
}
基准测试显示 indirectInc
工作速度快 3 倍,尽管 "optimization" 一点也不明显。人们会认为 indirectInc
应该工作得慢一点,因为它涉及一个额外的中间操作。
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 30 127,301 ± 0,202 ops/ms
LoopInc.indirectInc thrpt 30 378,147 ± 1,144 ops/ms
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
是什么导致 JIT 编译 indirectInc
优于类似的 directInc
?
好的,这就是你处理这些事情的方式。
尝试重现。好的,它重现了:
Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 175.678 ± 1.118 ops/ms LoopInc.indirectInc thrpt 15 641.413 ± 9.722 ops/ms
尝试用
-prof perfasm
查看生成的程序集。长话短说,它会生成大量代码,因此我们可能希望限制循环展开。但是,它会影响性能,并且很可能是原因。所以,让我们用-XX:LoopUnrollLimit=1
重新运行。好吧,分数低了,但是差距还是有的,优秀:Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 161.147 ± 6.101 ops/ms LoopInc.indirectInc thrpt 15 489.430 ± 1.698 ops/ms
再看一下生成的代码,仍然没有什么跳出来的东西。嗯,这看起来很有趣。让我们好好谈谈吧。我们可以描述工作量吗?当然,我们可以在
-prof perfnorm
的帮助下对每个基准操作的硬件计数器进行标准化。让我们看看:Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 161.875 ± 3.038 ops/ms LoopInc.directInc:·CPI thrpt 3 0.967 ± 0.196 #/op LoopInc.directInc:·L1-dcache-load-misses thrpt 3 0.394 ± 3.663 #/op LoopInc.directInc:·L1-dcache-loads thrpt 3 2149.594 ± 228.166 #/op LoopInc.directInc:·L1-dcache-store-misses thrpt 3 0.114 ± 1.001 #/op LoopInc.directInc:·L1-dcache-stores thrpt 3 1073.666 ± 96.066 #/op LoopInc.directInc:·L1-icache-load-misses thrpt 3 0.965 ± 22.984 #/op LoopInc.directInc:·LLC-loads thrpt 3 0.204 ± 2.763 #/op LoopInc.directInc:·LLC-stores thrpt 3 0.060 ± 0.633 #/op LoopInc.directInc:·branch-misses thrpt 3 536.068 ± 43.293 #/op LoopInc.directInc:·branches thrpt 3 3728.890 ± 220.539 #/op LoopInc.directInc:·cycles thrpt 3 26219.146 ± 6287.590 #/op LoopInc.directInc:·dTLB-load-misses thrpt 3 0.063 ± 0.124 #/op LoopInc.directInc:·dTLB-loads thrpt 3 2136.942 ± 165.990 #/op LoopInc.directInc:·dTLB-store-misses thrpt 3 0.022 ± 0.029 #/op LoopInc.directInc:·dTLB-stores thrpt 3 1084.787 ± 417.281 #/op LoopInc.directInc:·iTLB-load-misses thrpt 3 0.081 ± 0.333 #/op LoopInc.directInc:·iTLB-loads thrpt 3 3.623 ± 19.955 #/op LoopInc.directInc:·instructions thrpt 3 27114.052 ± 1843.720 #/op LoopInc.indirectInc thrpt 15 489.164 ± 2.692 ops/ms LoopInc.indirectInc:·CPI thrpt 3 0.281 ± 0.015 #/op LoopInc.indirectInc:·L1-dcache-load-misses thrpt 3 0.503 ± 9.071 #/op LoopInc.indirectInc:·L1-dcache-loads thrpt 3 2149.806 ± 369.040 #/op LoopInc.indirectInc:·L1-dcache-store-misses thrpt 3 0.167 ± 1.370 #/op LoopInc.indirectInc:·L1-dcache-stores thrpt 3 1073.895 ± 186.741 #/op LoopInc.indirectInc:·L1-icache-load-misses thrpt 3 0.313 ± 1.275 #/op LoopInc.indirectInc:·branch-misses thrpt 3 1.102 ± 0.375 #/op LoopInc.indirectInc:·branches thrpt 3 2143.670 ± 228.475 #/op LoopInc.indirectInc:·cycles thrpt 3 8701.665 ± 706.183 #/op LoopInc.indirectInc:·dTLB-load-misses thrpt 3 0.020 ± 0.301 #/op LoopInc.indirectInc:·dTLB-loads thrpt 3 2141.965 ± 135.852 #/op LoopInc.indirectInc:·dTLB-store-misses thrpt 3 0.002 ± 0.029 #/op LoopInc.indirectInc:·dTLB-stores thrpt 3 1070.376 ± 81.445 #/op LoopInc.indirectInc:·iTLB-load-misses thrpt 3 0.007 ± 0.135 #/op LoopInc.indirectInc:·iTLB-loads thrpt 3 0.310 ± 5.768 #/op LoopInc.indirectInc:·instructions thrpt 3 30968.207 ± 3627.540 #/op
哦,两个基准测试的指令数量相当。较慢的需要更多的周期(这就是为什么 CPI 在
directInc
中也不理想;但是,indirectInc
会产生接近理想的 CPI)。如果您仔细查看可能的原因:没有太多缓存未命中,也没有太多 TLB 未命中,但是慢速基准测试有很多分支未命中。啊哈!现在我们知道要在生成的代码中查看什么了。让我们再看看生成的代码。
-prof perfasm
方便地突出显示跳跃。然后你会看到这个...directInc:
╭│ 0x00007fa0a82a50ff: jmp 0x00007fa0a82a5116 11.39% 16.90% ││ ↗ 0x00007fa0a82a5101: inc %edx ;*iinc ││ │ ; - org.openjdk.LoopInc::directInc@46 (line 18) 12.52% 23.11% ││ │↗↗ 0x00007fa0a82a5103: mov %r10,0xe8(%r11) ;*invokevirtual putLong ││ │││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241) 12.00% 8.14% ││ │││ 0x00007fa0a82a510a: inc %r8d ;*iinc ││ │││ ; - org.openjdk.LoopInc::directInc@46 (line 18) 0.03% 0.03% ││ │││ 0x00007fa0a82a510d: cmp [=13=]x3e8,%r8d │╰ │││ 0x00007fa0a82a5114: jge 0x00007fa0a82a50c7 ;*aload_0 │ │││ ; - org.openjdk.LoopInc::directInc@11 (line 19) 0.80% 0.91% ↘ │││ 0x00007fa0a82a5116: mov 0xf0(%r11),%r10d ;*invokevirtual getInt │││ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222) 4.28% 1.23% │││ 0x00007fa0a82a511d: test %r10d,%r10d ╭│││ 0x00007fa0a82a5120: je 0x00007fa0a82a517b ;*ifne ││││ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222) 2.11% 0.01% ││││ 0x00007fa0a82a5122: movabs [=13=]x9e3779b97f4a7c15,%r10 0.01% 0.07% ││││ 0x00007fa0a82a512c: add 0xe8(%r11),%r10 ;*ladd ││││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242) 7.73% 1.89% ││││ 0x00007fa0a82a5133: mov %r10,%r9 1.21% 1.84% ││││ 0x00007fa0a82a5136: shr [=13=]x21,%r9 1.90% 0.03% ││││ 0x00007fa0a82a513a: xor %r10,%r9 2.02% 0.03% ││││ 0x00007fa0a82a513d: movabs [=13=]xff51afd7ed558ccd,%rcx 0.94% 1.82% ││││ 0x00007fa0a82a5147: imul %rcx,%r9 ;*lmul ││││ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182) 7.01% 2.40% ││││ 0x00007fa0a82a514b: mov %r9,%rcx ││││ 0x00007fa0a82a514e: shr [=13=]x21,%rcx 1.89% 0.70% ││││ 0x00007fa0a82a5152: xor %r9,%rcx 3.11% 2.55% ││││ 0x00007fa0a82a5155: movabs [=13=]xc4ceb9fe1a85ec53,%r9 0.99% 1.50% ││││ 0x00007fa0a82a515f: imul %r9,%rcx 7.66% 2.89% ││││ 0x00007fa0a82a5163: shr [=13=]x20,%rcx 3.70% 1.97% ││││ 0x00007fa0a82a5167: mov %ecx,%r9d 0.11% ││││ 0x00007fa0a82a516a: and [=13=]x1,%r9d ;*iand ││││ ; - java.util.concurrent.ThreadLocalRandom::nextInt@34 (line 356) 3.76% 11.13% ││││ 0x00007fa0a82a516e: cmp [=13=]x1,%r9d │╰││ 0x00007fa0a82a5172: je 0x00007fa0a82a5101 10.48% 16.62% │ ││ 0x00007fa0a82a5174: test %r9d,%r9d │ ╰│ 0x00007fa0a82a5177: je 0x00007fa0a82a5103 ;*lookupswitch │ │ ; - org.openjdk.LoopInc::directInc@15 (line 19) │ ╰ 0x00007fa0a82a5179: jmp 0x00007fa0a82a5103 ;*aload_0 │ ; - org.openjdk.LoopInc::directInc@11 (line 19) ↘ 0x00007fa0a82a517b: mov [=13=]xffffff5d,%esi
indirectInc:
0.01% 0.01% ↗ 0x00007f65588d8260: mov %edx,%r9d 0.01% │ 0x00007f65588d8263: nopw 0x0(%rax,%rax,1) 11.99% 11.38% │ 0x00007f65588d826c: data16 data16 xchg %ax,%ax ;*iconst_0 │ ; - org.openjdk.LoopInc::indirectInc@11 (line 34) │ 0x00007f65588d8270: mov 0xf0(%r8),%r10d ;*invokevirtual getInt │ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222) │ 0x00007f65588d8277: test %r10d,%r10d │ 0x00007f65588d827a: je 0x00007f65588d8331 ;*ifne │ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222) 0.01% │ 0x00007f65588d8280: movabs [=14=]x9e3779b97f4a7c15,%r10 11.80% 11.49% │ 0x00007f65588d828a: add 0xe8(%r8),%r10 ;*ladd │ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242) 0.01% 0.01% │ 0x00007f65588d8291: mov %r10,0xe8(%r8) ;*invokevirtual putLong │ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241) │ 0x00007f65588d8298: mov %r9d,%edx 0.01% 0.01% │ 0x00007f65588d829b: inc %edx 11.12% 12.40% │ 0x00007f65588d829d: mov %r10,%rcx 0.01% │ 0x00007f65588d82a0: shr [=14=]x21,%rcx 0.03% │ 0x00007f65588d82a4: xor %r10,%rcx 0.06% 0.03% │ 0x00007f65588d82a7: movabs [=14=]xff51afd7ed558ccd,%r10 12.38% 13.94% │ 0x00007f65588d82b1: imul %r10,%rcx ;*lmul │ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182) 0.03% 0.01% │ 0x00007f65588d82b5: mov %rcx,%r10 │ 0x00007f65588d82b8: shr [=14=]x21,%r10 0.03% │ 0x00007f65588d82bc: xor %rcx,%r10 11.43% 12.62% │ 0x00007f65588d82bf: movabs [=14=]xc4ceb9fe1a85ec53,%rcx 0.01% │ 0x00007f65588d82c9: imul %rcx,%r10 0.34% 0.30% │ 0x00007f65588d82cd: shr [=14=]x20,%r10 0.85% 0.76% │ 0x00007f65588d82d1: mov %r10d,%r10d 11.81% 11.51% │ 0x00007f65588d82d4: and [=14=]x1,%r10d 2.16% 1.78% │ 0x00007f65588d82d8: cmp [=14=]x1,%r10d 3.45% 3.00% │ 0x00007f65588d82dc: cmovne %r9d,%edx <----- HERE IT IS 17.55% 15.86% │ 0x00007f65588d82e0: inc %r11d ;*iinc │ ; - org.openjdk.LoopInc::indirectInc@56 (line 33) │ 0x00007f65588d82e3: cmp [=14=]x3e8,%r11d ╰ 0x00007f65588d82ea: jl 0x00007f65588d8260 ;*if_icmpge ; - org.openjdk.LoopInc::indirectInc@8 (line 33)
注意
cmovne
而不是jmp
-- 这就是我们有更多 "predictable" 分支的原因。 HotSpot 分析分支,并在分支分析分支非常平坦时发出条件移动。换句话说,通过为条件移动增加的延迟付出一点代价来避免很可能发生的分支预测错误。然而,在这种情况下,switch 是特殊的:它有 两个以上 选项(0、1 和 "nothing")。这就是为什么,我推测,result
增量没有被折叠到 cmov 中。 (一般来说,HotSpot 可以在 "default" 中存储零到result
,但它吹了,哦好吧)为了证实这个假设,让我们做一个
directCompleteInc
案例,我们仍然使用switch
,但现在涵盖所有案例:@Benchmark public int directCompleteInc() { int result = 0; for (int i = 0; i < 1000; i++) { switch (getValue()) { case 1: result++; break; default: break; } } return result; }
...并测量它,这次没有任何选项,就像 OP 所做的那样:
Benchmark Mode Cnt Score Error Units LoopInc.directCompleteInc thrpt 5 644.414 ± 0.371 ops/ms LoopInc.directInc thrpt 5 174.974 ± 0.103 ops/ms LoopInc.indirectInc thrpt 5 644.015 ± 0.533 ops/ms
那里。
确认
directCompleteInc
正在使用cmov
和-prof perfasm
。确实如此。喝一口。