ArrayList和HashSet内存分配奇怪的测试结果

Question

我受到了这个主题的启发：Performance and Memory allocation comparision between List and Set 实际运行一些测试和测量 ArrayList 和 HashSet 之间的性能差异。

在提到的主题中，最让我感兴趣的答案 (link) 说：

HashSet consumes about 5.5 times more memory than ArrayList for the same number of elements

在 ScalaMeter 的帮助下，我想确定一下。

我做了两个简单的测试，将 10000 到 100000 个元素添加到 ArrayList 和 HashSet。 将初始大小设置为最大值并没有改变结果。我用两种类型测试了那些 collection：

Int（输入连续的数字 0 到 100000）
String（使用 Apache RandomStringUtils 放置随机字符串）

我的存储库中提供了代码 here。

然后运行宁那些，给了我这个结果：

X-axis - 大小 -> collection
Y-axis - 值 -> 使用的 kB 量

持有 Int collection 秒：

持有 String 大小 10 的 collection 秒：

持有 String 50 码 collection 秒：

问题：

引用的答案中提到的理论发生了什么变化？是假的吗？或者我这边可能有什么错误？

谢谢 :)！

@andrzej 回答后更新 我再次更新了代码（和存储库）。结果越来越好，但结果仍然没有 5.5 倍的不同。我现在正在检查更多内容。

Answer 1

请将测量对象添加为 return 值。

measure method "Int" in {
  using(sizes) curve listS in { i =>
    val c = new util.ArrayList[Int](i)
    (0 until i).map(t => c.add(t))
    c // return c
  }

  using(sizes) curve setS in { i =>
    val c = new util.HashSet[Int]()
    (0 until i).map(t => c.add(t))
    c // return c
  }
}

Answer 2

我认为，这里有两个问题：

正如 Andrzej 所提到的，您没有 return 来自基准代码段的集合。 Scalameter 通过在基准执行之前和之后执行 GC 来测量占用空间（查找详细信息 here）。如果你不 return 收集，它只是被测试后的 GC 从内存中删除，测试结果是无用的。它解释了为什么测试中的内存占用量仍然很小（每个对象大约四个字节）并且没有区别。但这并不能解释为什么随着集合大小的增加，占用空间也会增加，第二个问题就来了。
一些垃圾收集器（尤其是 CMS 和 G1）不保证在执行垃圾收集后所有死对象都从内存中删除。如果您的 JVM 选择这些收集器之一（或者如果您手动指定它），这将解释内存占用上升趋势。您可以通过为测试提供 -XX:+PrintFlagsFinal 选项并查找 UseG1GC 和 UseConcMarkSweepGC 标志的值来检查正在使用的收集器。

Answer 3

What happened to the theory mentioned in the quoted answer? Is it false?

我们可以做一些计算来估算：

让我们看看 ArrayList and HashMap 的 OpenJDK 源代码（因为 HashSet 只是 HashMap 的包装器）以获取提示。

假设您有 n 个元素要存储。

数组列表

元素存储在字段 transient Object[] elementData; 中。所以 elementData 的长度必须至少为 n.
假设您用 new ArrayList<>(n) 实例化了列表，所以 elementData.length 正好是 n。那么列表的大小是 n*c 字节（其中 c 是对象引用的大小）。这里我忽略了size字段和列表的object header。

HashMap

HashMap 将元素存储在 transient Node<K,V>[] table; 中，其中节点具有字段

final int hash;
final K key;
V value;
Node<K,V> next;

然后为了存储 n 个元素，您需要 n 个节点或 n*(3*c + 4) 个字节，即每个节点有 3 个对象引用 - 3*c 个字节 - 和一个 int - 4 个字节。
根据 HashMap javadoc:

When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

据此我估计 table.length == 2*n.
汇总一个 hashmap 需要 n*2*c + n*(3*c + 4) = n*5*c + n*4 字节。

总结

现在假设您有一个 64 位 JVM，并且对象引用的大小是 8 个字节（即 c = 8）（让我们忽略 compressed oops 之类的东西）。然后 n*5*c + n*4 = n*5*8 + n*4 = n*44 和 n*c = n*8.
最后n*44 / n*8 = 5.5

因此，HashSet 消耗的内存比 ArrayList 多 5.5 倍的原始理论似乎很合理，而且您的测量结果似乎有问题。

ArrayList和HashSet内存分配奇怪的测试结果

ArrayList and HashSet memory allocation strange test results

java

collections

scala

performance-testing

scalameter

数组列表

HashMap

总结