有没有更好的方法来实现直方图？

Question

我有一个 2D uint16 numpy 数组，我想计算这个数组的直方图。我使用的函数是：

def calc_hist(source):
    hist = np.zeros(2**16, dtype='uint16')
    for i in range(source.shape[0]):
       for j in range(source.shape[1]):
           hist[source[i, j]] = hist[source[i, j]] + 1
    return hist

此函数执行时间过长。据我了解，numpy 模块中有一个直方图函数，但我不知道如何使用它。我试过：

hist,_ = np.histogram(source.flatten(), bins=range(2**16))

但我得到的结果与我自己的函数不同。我怎样才能调用 numpy.histogram 来获得相同的结果？或者还有其他选择吗？

Answer 1

正如 Corley Brigman 指出的那样，通过 bins=range(x) 确定 bin 边缘 [1]。因此，您最终会得到具有对应边 [0, 1), [1, 2), ..., [x-1, x].

的 x-1 个 bin

在您的情况下，您将有 2^16-1 个垃圾箱。要修复它，只需使用 range(2**16+1).

[1] https://numpy.org/doc/stable/reference/generated/numpy.histogram.html?highlight=histogram#numpy.histogram

Answer 2

对于数据类型为 uint16 的输入，numpy.bincount 应该工作正常：

hist = np.bincount(source.ravel(), minlength=2**16)

您的函数几乎与 bincount 完全相同，但 bincount 是在 C 中实现的。

例如，以下检查 bincount 的使用是否与 calc_hist 函数产生相同的结果：

In [159]: rng = np.random.default_rng()

In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))                                      

In [161]: h1 = calc_hist(x)

In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)

In [163]: (h1 == h2).all()  # Verify that they are the same.
Out[163]: True

使用ipython %timeit 命令检查性能。您可以看到使用 bincount 比 快得多 。

In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

有没有更好的方法来实现直方图？

Is there a better way of implementing a histogram?

python

performance

numpy

histogram