Numpy.digitize() 在使用 numpy.histogram 中的 bin 边缘时给出比预期更多的 bin

Question

我的目标是为图像的每个像素（在我的例子中是一个 numpy 数组）分配它落入的 bin 的频率。例如，我有一个像 [0.5, 1, 2, 2, 4] 这样的值和像 [0, 1.5, 2.5, 3.5, 4.5] 这样的 bin 边缘。所以第一个 bin 的频率应该是 2，第二个 2，第三个 0 和第四个 1。所以结果应该是 [2, 2, 2, 2, 1].

我的计划是先使用 numpy.histogram() 获取频率和 bin 边缘，然后使用 numpy.digitize() 和 bin 边缘将像素落入的 bin 索引分配给每个像素。然后我想使用这些分配的索引在 hist 中查找相应的频率。但是我遇到了 numpy.digitize() 给我更多垃圾箱的问题，然后在 hist 中有，我不知道为什么。

我的代码如下所示：

首先，我有一个像这样的图像（一个 numpy 数组）：

a_noise = np.random.normal(0, 1, 40000).reshape((200,200))

接下来，我取直方图吧：

hist, bin_edges = np.histogram(a_noise, bins='fd')

现在我使用 np.digitize 将 bin 索引分配给像素。

a_binidx = np.digitize(a_noise, bin_edges, right=True)

结果我得到：

hist.shape

总共

(109,) 个 bin，因此可能的索引范围从 0 到 108。

bin_edges.shape

总共

(110,) bin_edges，这对我来说很有意义。但是当我检查给出了哪些 bin 索引时，我得到的结果是：

np.unique(a_binidx)

array([ 0, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

最高索引为 109。hist 可能的最高索引为 108。

为什么我得到的是 109 而不是 108 个索引？

Answer 1

我使用 pandas.cut():

解决了这个问题

a_binidx = pd.cut(a_noise.flatten(), bins=bin_edges, labels=np.arange(hist.shape[0]), include_lowest=True)

Numpy.digitize() 在使用 numpy.histogram 中的 bin 边缘时给出比预期更多的 bin

Numpy.digitize() gives more bins than expected when using bin edges from numpy.histogram

python

arrays

numpy

histogram