为什么直接内存 'array' 比通常的 Java 数组更慢?
Why direct memory 'array' is slower to clear than a usual Java array?
我已经设置了一个 JMH 基准来衡量什么会更快 Arrays.fill
空数组,System.arraycopy
来自空数组,对 DirectByteBuffer 进行零化或对 unsafe
内存块进行零化试图回答这个
让我们撇开对直接分配的内存进行零化是一种罕见的情况,并讨论我的基准测试结果。
这是 JMH 基准代码片段 (full code available via a gist),包括 @apangin 在原始 post、byteBuffer.put(byte[], offset, length)
和 longBuffer.put(long[], offset, length)
中建议的 unsafe.setMemory
案例@jan-schaefer 建议:
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFill() {
Arrays.fill(objectHolderForFill, null);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayCopy() {
System.arraycopy(nullsArray, 0, objectHolderForArrayCopy, 0, objectHolderForArrayCopy.length);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferManualLoop() {
while (referenceHolderByteBuffer.hasRemaining()) {
referenceHolderByteBuffer.putLong(0);
}
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferBatch() {
referenceHolderByteBuffer.put(nullBytes, 0, nullBytes.length);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferManualLoop() {
while (referenceHolderLongBuffer.hasRemaining()) {
referenceHolderLongBuffer.put(0L);
}
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferBatch() {
referenceHolderLongBuffer.put(nullLongs, 0, nullLongs.length);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArrayManualLoop() {
long addr = referenceHolderUnsafe;
long pos = 0;
for (int i = 0; i < size; i++) {
unsafe.putLong(addr + pos, 0L);
pos += 1 << 3;
}
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArraySetMemory() {
unsafe.setMemory(referenceHolderUnsafe, size*8, (byte) 0);
}
这是我得到的(Java 1.8,JMH 1.13,Core i3-6100U 2.30 GHz,Win10):
100 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 5234029 39,518 ± 0,991 ns/op
ArrayNullFillBench.directByteBufferBatch sample 6271334 43,646 ± 1,523 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4615974 45,252 ± 2,352 ns/op
ArrayNullFillBench.arrayFill sample 4745406 76,997 ± 3,547 ns/op
ArrayNullFillBench.unsafeArrayManualLoop sample 5980381 78,811 ± 2,870 ns/op
ArrayNullFillBench.unsafeArraySetMemory sample 5985884 85,062 ± 2,096 ns/op
ArrayNullFillBench.directLongBufferManualLoop sample 4697023 116,242 ± 2,579 ns/op WOW
ArrayNullFillBench.directByteBufferManualLoop sample 7504629 208,440 ± 10,651 ns/op WOW
I skipped all the loop implementations (except arrayFill for scale) from further tests
1000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 6780681 184,516 ± 14,036 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4018778 293,325 ± 4,074 ns/op
ArrayNullFillBench.directByteBufferBatch sample 4063969 313,171 ± 4,861 ns/op
ArrayNullFillBench.arrayFill sample 6862928 518,886 ± 6,372 ns/op
10000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 2551851 2024,543 ± 12,533 ns/op
ArrayNullFillBench.directLongBufferBatch sample 2958517 4469,210 ± 10,376 ns/op
ArrayNullFillBench.directByteBufferBatch sample 2892258 4526,945 ± 33,443 ns/op
ArrayNullFillBench.arrayFill sample 5689507 5028,592 ± 9,074 ns/op
能否请您澄清以下问题:
1. Why `unsafeArraySetMemory` is a bit but slower than `unsafeArrayManualLoop`?
2. Why directByteBuffer is 2.5X-5X slower than others?
Why unsafeArraySetMemory is a bit but slower than unsafeArrayManualLoop?
我的猜测是它没有针对设置多个多头进行优化。它必须检查你是否有东西,而不是 8 的倍数。
Why directByteBuffer is by an order of magnitude slower than others?
一个数量级大约是 10 倍,慢了大约 2.5 倍。它必须对每次访问进行边界检查并更新字段而不是局部变量。
注意:我发现 JVM 并不总是使用 Unsafe 循环展开代码。您可以自己尝试这样做,看看是否有帮助。
注意:本机代码可以使用 XMM 128 位指令,并且正在越来越多地使用它,这就是复制速度如此之快的原因。访问 XMM 指令可能会进入 Java 10.
这种比较有点不公平。在使用 Array.fill
和 System.arraycopy
时,您使用的是单个操作,但在 DirectByteBuffer
的情况下,您使用的是 putLong
的循环和多次调用。如果你看一下 putLong
的实现,你会发现那里有很多事情要做,例如检查可访问性。您应该尝试使用像 put(long[] src, int srcOffset, int longCount)
这样的批处理操作,看看会发生什么。
我已经设置了一个 JMH 基准来衡量什么会更快 Arrays.fill
空数组,System.arraycopy
来自空数组,对 DirectByteBuffer 进行零化或对 unsafe
内存块进行零化试图回答这个
这是 JMH 基准代码片段 (full code available via a gist),包括 @apangin 在原始 post、byteBuffer.put(byte[], offset, length)
和 longBuffer.put(long[], offset, length)
中建议的 unsafe.setMemory
案例@jan-schaefer 建议:
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFill() {
Arrays.fill(objectHolderForFill, null);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayCopy() {
System.arraycopy(nullsArray, 0, objectHolderForArrayCopy, 0, objectHolderForArrayCopy.length);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferManualLoop() {
while (referenceHolderByteBuffer.hasRemaining()) {
referenceHolderByteBuffer.putLong(0);
}
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferBatch() {
referenceHolderByteBuffer.put(nullBytes, 0, nullBytes.length);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferManualLoop() {
while (referenceHolderLongBuffer.hasRemaining()) {
referenceHolderLongBuffer.put(0L);
}
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferBatch() {
referenceHolderLongBuffer.put(nullLongs, 0, nullLongs.length);
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArrayManualLoop() {
long addr = referenceHolderUnsafe;
long pos = 0;
for (int i = 0; i < size; i++) {
unsafe.putLong(addr + pos, 0L);
pos += 1 << 3;
}
}
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArraySetMemory() {
unsafe.setMemory(referenceHolderUnsafe, size*8, (byte) 0);
}
这是我得到的(Java 1.8,JMH 1.13,Core i3-6100U 2.30 GHz,Win10):
100 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 5234029 39,518 ± 0,991 ns/op
ArrayNullFillBench.directByteBufferBatch sample 6271334 43,646 ± 1,523 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4615974 45,252 ± 2,352 ns/op
ArrayNullFillBench.arrayFill sample 4745406 76,997 ± 3,547 ns/op
ArrayNullFillBench.unsafeArrayManualLoop sample 5980381 78,811 ± 2,870 ns/op
ArrayNullFillBench.unsafeArraySetMemory sample 5985884 85,062 ± 2,096 ns/op
ArrayNullFillBench.directLongBufferManualLoop sample 4697023 116,242 ± 2,579 ns/op WOW
ArrayNullFillBench.directByteBufferManualLoop sample 7504629 208,440 ± 10,651 ns/op WOW
I skipped all the loop implementations (except arrayFill for scale) from further tests
1000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 6780681 184,516 ± 14,036 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4018778 293,325 ± 4,074 ns/op
ArrayNullFillBench.directByteBufferBatch sample 4063969 313,171 ± 4,861 ns/op
ArrayNullFillBench.arrayFill sample 6862928 518,886 ± 6,372 ns/op
10000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 2551851 2024,543 ± 12,533 ns/op
ArrayNullFillBench.directLongBufferBatch sample 2958517 4469,210 ± 10,376 ns/op
ArrayNullFillBench.directByteBufferBatch sample 2892258 4526,945 ± 33,443 ns/op
ArrayNullFillBench.arrayFill sample 5689507 5028,592 ± 9,074 ns/op
能否请您澄清以下问题:
1. Why `unsafeArraySetMemory` is a bit but slower than `unsafeArrayManualLoop`?
2. Why directByteBuffer is 2.5X-5X slower than others?
Why unsafeArraySetMemory is a bit but slower than unsafeArrayManualLoop?
我的猜测是它没有针对设置多个多头进行优化。它必须检查你是否有东西,而不是 8 的倍数。
Why directByteBuffer is by an order of magnitude slower than others?
一个数量级大约是 10 倍,慢了大约 2.5 倍。它必须对每次访问进行边界检查并更新字段而不是局部变量。
注意:我发现 JVM 并不总是使用 Unsafe 循环展开代码。您可以自己尝试这样做,看看是否有帮助。
注意:本机代码可以使用 XMM 128 位指令,并且正在越来越多地使用它,这就是复制速度如此之快的原因。访问 XMM 指令可能会进入 Java 10.
这种比较有点不公平。在使用 Array.fill
和 System.arraycopy
时,您使用的是单个操作,但在 DirectByteBuffer
的情况下,您使用的是 putLong
的循环和多次调用。如果你看一下 putLong
的实现,你会发现那里有很多事情要做,例如检查可访问性。您应该尝试使用像 put(long[] src, int srcOffset, int longCount)
这样的批处理操作,看看会发生什么。