有没有比遍历 numpy 数组更快的方法？

Question

如果我有两个 numpy 值数组；我怎样才能快速制作第三个数组，让我知道前两个数组中两个值相同的次数？

例如：

x = np.round(np.random.random(2500),2)
xIndex = np.linspace(0, 1, 100)

y = np.round(np.random.random(2500)*10,2)
yIndex = np.linspace(0, 10, 1000)

z = np.zeros((100,1000))

现在，我正在执行以下循环（非常慢）：

for m in x:
    for n in y:
        q = np.where(xIndex == m)[0][0]
        l = np.where(yIndex == n)[0][0]
        z[q][l] += 1

然后我可以绘制 xIndex、yIndex 和 z 的等高线图（或热图，或其他）。但我知道我并没有采用“Pythonic”方式来解决这个问题，而且我无法运行在任何接近合理时间范围内的数亿个数据点上。

如何正确执行此操作？感谢阅读！

Answer 1

您可以大幅截断代码。

首先，由于您有一个线性刻度，您可以在其中进行装箱，因此您可以完全消除显式数组 xIndex 和 yIndex。您可以将确切的索引表示为 z 为

xi = np.round(np.random.random(2500) * 100).astype(int)
yi = np.round(np.random.random(2500) * 1000).astype(int)

其次，您不需要循环。正常 + 运算符的问题（a.k.a。np.add) is that it's buffered. A consequence of that is that you won't get the right count for multiple occurrencs of the same index. Fortunately, ufuncs have an at 方法处理该问题，而 add 是一个 ufunc。

第三，也是最后一点，广播允许您指定如何为奇特索引对数组进行网格化：

np.add.at(z, (xi[:, None], yi), 1)

如果您要构建二维直方图，则不需要对原始数据进行舍入。您可以只舍入索引：

x = np.random.random(2500)
y = np.random.random(2500) * 10

z = np.zeros((100,1000))
np.add.at(z, (np.round(100 * x).astype(int), np.round(100 * y).astype(int)), 1)

有没有比遍历 numpy 数组更快的方法？

Is there a faster way than looping over numpy arrays?

python

loops

numpy

vectorization

python-3.x