从 Java 中的数组制作直方图的最有效方法

Question

我想通过装箱（下面的示例数组）来计算双精度数组中数字出现的频率。 Python numpy's histogram(). I'm on a constrained environment and have access to basic Java Math and jblas library, but nothing else and no other third party libraries like colt 提供的基本相同功能是可安装的。

double[] x1 = {1, 1, 2, 2, 1, 3, 2}

我有一个单独的排序数组，它标记 binEdges 的开始和结束，如下所示：

binEdges = [4.9E-324, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 4.0, 4.0, 5.0, 5.0, 7.0, 1.7976931348623157E308]

请注意，binEdges 数组可能有重复的元素，我希望它们保持这样。因此，对于给定的 binEdges 数组，频率计数的结果将如下所示：

binCounts = [0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]

binCounts数组结合binEdges，从左到右读时如下，注意bin区间上的大括号：

Bin interval frequency [4.9E-324, 1.0) 0 [1.0, 1.0) 0 [1.0, 1.0) 0 [1.0, 2.0) 3 (since we have 3 ones in x1) . . . . . .

我目前有以下实现，它在 O(nlgn) 中运行，假设排序需要 O(nlgn)。我想知道是否可以将其削减到低于 O(nlgn)。我也在 jblas 中四处张望，但不知道用于分箱的库函数，如果这里的人们对其他本机 Java 技巧或聪明的索引方案有任何其他见解，他们可以向我指出。也欢迎其他关于改进代码以减少运行时间的建议。

缩短时间很重要，因为手头的数据量很大。

public static double [] binCounts(double[] x, double[] binEdges){
    double [] ret = new double[binEdges.length - 1];
    Arrays.sort(x); // takes O(nlgn), the loop below is effectively O(n)
    int k = 0;
    for (int i = 0; i < binEdges.length - 1; i++) {    
        if (binEdges[i] == binEdges[i+1])
            continue;
        for (int j = k; j < x.length; j++){
            if (x[j] >= binEdges[i+1])
                break;
            else if (x[j] >= binEdges[i] && x[j] < binEdges[i+1]){
                ret[i] += 1;
                k++;
            }
        }
    }
    return ret;
}

Answer 1

如果您查看您的数据，您可以尝试识别它们是否有任何模式，您可以找出任何适合的最佳案例排序算法，或者了解图像压缩的方式。

当考虑视频游戏对象时，每次帧更新的协调更新可能只是一个小的调整，因此我们可以简单地应用冒泡排序，大多数情况下它是时间复杂度的最佳情况。

如果您的数据表明可能的值是一小组数字，请考虑像一次通过这样的事情，并即时进行计数。所以你真的不需要有一个排序步骤。

旁注：我在数据量很大时的经验也主要与 space 复杂性有关，考虑一台 RAM 有限但硬盘很大的机器。那样的话，我会考虑瓶颈在硬盘读写上，或者在分布式系统上可以在网络存储上。像你的 new double[binEdges.length - 1] 这样的东西可能会导致 OutOfMemory.

此外，尝试使用 HashSet 或类似工具。

Answer 2

您可以使用 TreeMap 对 binEdge 进行二进制搜索：

public static double[] binCounts(double[] x, double[] binEdges) {
    int binEdgesSize = binEdges.length;
    NavigableMap<Double, Integer> binEdgesMap = new TreeMap<>();
    for (int i = 0; i < binEdgesSize; ++i)
        binEdgesMap.put(binEdges[i], i);
    double [] ret = new double[binEdgesSize - 1];
    for (double d : x) {
        Entry<Double, Integer> e = binEdgesMap.ceilingEntry(d);
        if (e != null)
            ++ret[e.getValue()];
    }
    return ret;
}

Answer 3

@saka1029 感谢您展示 NavigableMap 容器 class（我不知道）。这似乎可以通过消除 ret 对象并直接使用密钥来简化。由于 binCount 地图的值是一个整数，我们可以增加它：

public static double[] binCounts(double[] x, double[] binEdges) {
    int binEdgesSize = binEdges.length;
    // binCount: Key = lower edge of bin; Value = item count
    NavigableMap<Double, Integer> binCount = new TreeMap<>();
    for (int i = 0; i < binEdgesSize; ++i)
        binCount.put(binEdges[i], 0);  // Initialize count to zero
    for (double item : x) {
        Double edge = binCount.floorKey(item);
        if (edge != null)
            binCount.get(edge)++;
    }
    return binCount.values();
}

从 Java 中的数组制作直方图的最有效方法

Most efficient way to make a Histogram from an array in Java

java

algorithm

histogram