HashSet vs ArrayList CPU 使用率很高

Question

我有 104k 个字符串值，其中 89k 个是唯一的。我想检查这个列表中是否存在一个字符串。

这是我的 class 及其保存所有这些记录的方法。

public class TestClass {
    private static TestClass singletonObj = null;
    private List<String> stringList= null;

    public static synchronized TestClass getInstance() {
        if(singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }


    public boolean isValidString(String token) {
        if(stringList == null) {
            init();
        }
        if(stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private init() {
     stringList = new ArrayList<String>();
     // put all 104k values in this data structure.
    }
}

我的应用程序尝试同时使用此 isValidString() 方法，每秒大约 20 个请求。这工作正常，但是当我尝试将数据结构更改为 HashSet 时，CPU 使用率非常高。根据我的理解，Hashset 应该比 ArrayList[o(n)] 表现得更好[o(1)]。谁能解释一下为什么会这样？

Answer 1

JDK HashSet 建立在 HashMap<T, Object> 之上，其中值是一个单例“present” object。这意味着 HashSet 的内存消耗与 HashMap 相同：为了存储 SIZE 值，您需要 32 * SIZE + 4 * CAPACITY 字节（加上您的值的大小）。

对于ArrayList，它是java.util.ArrayList的容量乘以参考尺寸(4 bytes on 32bit, 8bytes on 64bit) + [Object header + one int and one references]。

所以HashSet绝对不是memory-friendlycollection.

取决于您使用的是 32-bit 还是 64-bit 虚拟机。也就是说，HashSet 比 ArrayList 受到 8-byte 引用的伤害更严重——根据链接的内存消耗图表，为每个引用添加一个额外的 4 bytes，使 ArrayList 每个元素达到 ~12 字节, 和 HashSet 每个元素最多 ~52 个字节。）

ArrayList 是使用 Object 数组实现的。下图显示了 ArrayList 在 32 位 Java 运行时的内存使用情况和布局：

ArrayList 在 32 位 Java 运行时的内存使用和布局

上图显示，当创建 ArrayList 时，结果是使用 32 bytes 内存的 ArrayList object，以及 Object 数组，默认大小为 10，一个空 ArrayList 的内存总计 88 bytes。这意味着 ArrayList 的大小不准确，因此具有默认容量，恰好是 10 entries.

ArrayList 的属性

Default capacity - 10

Empty size - 88 字节

Overhead - 48 字节加上每个条目 4 字节

Overhead 10K collection - ~40K

Search/insert/delete performance- O(n) — 所用时间与元素数量呈线性关系

HashSet 的功能少于 HashMap，因为它不能包含多个空条目并且不能有重复的条目。该实现是 HashMap 的包装器，HashSet object 管理允许放入 HashMap object 的内容。限制HashMap能力的附加功能意味着HashSets有稍微高一点的内存开销。

HashSet 在 32 位 Java 运行时的内存使用和布局

上图以字节为单位显示了浅堆（个体object的内存使用），以及保留堆（个体object及其child的内存使用objects) java.util.HashSet object 的字节数。浅堆大小为 16 bytes，保留堆大小为 144 bytes。创建 HashSet 时，其默认容量（可以放入集合中的条目数）为 16 entries。当以默认容量创建 HashSet 并且没有条目放入集合中时，它占用 144 bytes。这比 HashMap 的内存使用多了 16 bytes。

Table下面显示了一个HashSet的属性：

HashSet 的属性

Default capacity -16 条记录

Empty size -144 字节

Overhead -16 字节加 HashMap 开销

Overhead for a 10K collection -16 字节加 HashMap 开销

Search/insert/delete performance - O(1) — 所用时间为常数时间，与元素个数无关（假设没有散列冲突）

Answer 2

我的猜测是 HashSet 作为基于散列的结构，从将每个字符串 插入 HashSet 的那一刻起计算每个字符串的散列码 ，即在方法 init 中。这可能是 CPU 走高的时期，它是我们在迭代结构值时为获得更好的吞吐量而付出的代价的一部分。

如果我是对的，方法init结束后，CPU应该下降，程序的速度应该会大大提高，这就是使用HashSet的好处。

顺便说一下：一种可靠的优化方法是 预调整 结构：

ArrayList 的初始大小应等于将包含的最大元素数。
并且 HashSet 的初始大小比最大值大 1.7。

顺便说一句：String.hash 的标准哈希算法计算字符串的所有字符。例如，也许您可能只满足于计算前 100 个字符（当然取决于您正在处理的数据的性质）。然后，你可以将你的字符串封装到你自己的 class 中，用你自己的哈希算法覆盖 hashCode 方法，并覆盖 equals 方法进行严格比较。

Answer 3

我创建了一个简单的 class 来生成 20 个线程，根据此 post.

的底部每秒访问您的字典检查器

我无法复制您的结果 - 但这可能是因为我有权访问输入数据。我已经使用您的 TestClass 实施从英语开放单词列表 (EOWL) 中导入约 130,000 个单词。 ArrayList 或 HashSet 作为 stringList.

的类型，没有看到持续的高 CPU 使用率

我猜你的问题是由于你输入的数据。我尝试添加我的输入字典两次以创建重复 - 显然使用 ArrayList 这只会使列表长度增加一倍，但是使用 HashSet，这意味着代码正在丢弃重复项。您注意到大约 1/5 的输入数据是重复的。在我的测试中有 1/2 个重复项，我确实看到 slight 增加了 CPU 大约 2 秒然后又回落到stringList 初始化后几乎什么都没有。

如果您输入的字符串比我使用的单个单词更复杂，这个“信号”可能会持续更长时间。所以也许那是你的问题。或者 - 也许你有一些其他代码来包装这部分占用 CPU.

N.B. 我会提醒您，正如其他人对您实施 init 的评论一样。在我的实验中，我看到许多线程可以在字典完全初始化之前调用字典检查，从而为相同的测试单词提供不一致的结果。为什么不从你的构造函数中调用它，如果它是一个单例对象？

代码清单

您的带有一些输入数据代码的 TestClass：

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;

public class TestClass {
    private static TestClass singletonObj = null;
    //private List<String> stringList = null;

    private HashSet<String> stringList = null;

    public static synchronized TestClass getInstance() {
        if (singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }

    public boolean isValidString(String token) {
        if (stringList == null) {
            init();
        }
        if (stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private void init() {
        String dictDir = "C:\Users\Richard\Documents\EOWL_CSVs";
        File[] csvs = (new File(dictDir)).listFiles();
        stringList = new HashSet<String>();
        Scanner inFile = null;

        for (File f : csvs) {
            try {
                inFile = new Scanner(new FileReader(f));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
                System.exit(1);
            }

            while (inFile.hasNext()) {
                stringList.add(inFile.next().toLowerCase()
                        .replaceAll("[^a-zA-Z ]", ""));
            }
            inFile.close();
        }

        System.out.println("Dictionary initialised with " + stringList.size()
                + " members");
    }
}

访问它的线程：

import java.io.FileNotFoundException;

public class DictChecker extends Thread {

    TestClass t = null;
    public static int classId = 0;
    String className = null;
    
    
    public void doWork()
    {
        String testString = "Baby";
        if (t.isValidString(testString))
        {
            System.out.println("Got a valid string " + testString + " in class " + className);
        }
        else
        {
            System.out.println(testString + " not in the dictionary");
        }
    }
    
    public void run()
    {
        while (true)
        {
            try {
                DictChecker.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            doWork();
        }
    }
    
    public DictChecker()
    {
        t = TestClass.getInstance();
        className = "dChecker" + classId;
        classId += 1;
        System.out.println("Initialised " + className + " in thread " + this.getName());
    }
    
    public static void main(String[] args) throws FileNotFoundException
    {
        for (int i = 0; i < 20; i++)
        {
             (new DictChecker()).start();
             try {
                DictChecker.sleep(50);//simply to distribute load over the second
            } catch (InterruptedException e) {
                e.printStackTrace();
            } 
        }
    }
}

HashSet vs ArrayList CPU 使用率很高

HashSet vs ArrayList CPU usage is high

java

performance

arraylist

hashset

ArrayList 的属性

HashSet 的属性

代码清单