NumPy 在给定边界框坐标内填充大型数组的值

Question

我有一个非常大的 3d 数组

large = np.zeros((2000, 1500, 700))

实际上，large是一个图像，但对于每个坐标，它有 700 个值。另外，我有 400 个边界框。边界框没有固定的形状。我为每个框存储一个下限和上限坐标的元组，如下所示

boxes_y = [(y_lower0, y_upper0), (y_lower1, y_upper1), ..., (y_lower399, y_upper399)]
boxes_x = [(x_lower0, x_upper0), (x_lower1, x_upper1), ..., (x_lower399, x_upper399)]

然后，对于每个框，我想用大小为 700 的向量填充 large 数组中的相应区域。具体来说，我有一个 embeddings 数组用于每个框

embeddings = np.random.rand(400, 700) # In real case, these are not random. Just consider the shape

我想做的是

for i in range(400):
   large[boxes_y[i][0]: boxes_y[i][1], boxes_x[i][0]: boxes_x[i][1]] = embeddings[i]

这可行，但对于如此大的 large 数组来说太慢了。我正在寻找矢量化此计算。

Answer 1

一个大问题是输入确实巨大 (~15.6 GiB)。另一个是它在最坏的情况下最多传输 400 次（导致将多达 6240 GiB 写入 RAM）。问题是重叠区域写多次。

更好的解决方案是遍历第一个维度（“图像”中的一个维度），以找到必须按照@dankal444 的建议复制哪个边界框。这类似于基于 Z-buffer 的算法在计算机图形学中的作用。

基于此，更好的解决方案是使用scanline-rendering算法。在您的情况下，该算法比传统算法简单得多，因为您使用的是边界框而不是复杂的多边形。对于每个扫描线（此处为 2000），您可以快速过滤写入扫描线的边界框，然后对其进行迭代。经典算法对于您的简单案例来说有点太复杂了。对于每条扫描线，遍历过滤后的边界框并覆盖每个像素中的索引就足够了。这个操作可以在 parallel 中使用 Numba 来完成。它非常快，因为计算主要在 CPU 缓存中执行。

最后的操作是根据之前的索引执行实际的数据写入（仍然并行使用Numba）。这个操作仍然是memory bound，但是输出数组只写了一次（最坏情况下只会写15.6 GiB的RAM，并且float32 项为 7.8 GiB）。在大多数机器上，这应该只需要几分之一秒。如果这还不够，您可以尝试使用专用 GPU，因为 GPU RAM 通常比主 RAM 快得多（通常快一个数量级）。

实现如下：

# Assume the last dimension of `large` and `embeddings` is contiguous in memory
@nb.njit('void(float32[:,:,::1], float32[:,::1], int_[:,::1], int_[:,::1])', parallel=True)
def fastFill(large, embeddings, boxes_y, boxes_x):
    n, m, l = large.shape
    boxCount = embeddings.shape[0]
    assert embeddings.shape == (boxCount, l)
    assert boxes_y.shape == (boxCount, 2)
    assert boxes_x.shape == (boxCount, 2)
    imageBoxIds = np.full((n, m), -1, dtype=np.int16)
    for y in nb.prange(n):
        # Filtering -- A sort is not required since the number of bounding-box is small
        boxIds = np.where((boxes_y[:,0] <= y) & (y < boxes_y[:,1]))[0]
        for k in boxIds:
            lower, upper = boxes_x[k]
            imageBoxIds[y, lower:upper] = k
    # Actual filling
    for y in nb.prange(n):
        for x in range(m):
            boxId = imageBoxIds[y, x]
            if boxId >= 0:
                large[y, x, :] = embeddings[boxId]

这是基准：

large = np.zeros((1000, 750, 700), dtype=np.float32)  # 8 times smaller in memory
boxes_y = np.cumsum(np.random.randint(0, large.shape[0]//2, size=(400, 2)), axis=1)
boxes_x = np.cumsum(np.random.randint(0, large.shape[1]//2, size=(400, 2)), axis=1)
embeddings = np.random.rand(400, 700).astype(np.float32)

# Called many times
for i in range(400):
   large[boxes_y[i][0]:boxes_y[i][1], boxes_x[i][0]:boxes_x[i][1]] = embeddings[i]

# Called many times
fastFill(large, embeddings, boxes_y, boxes_x)

这是我机器上的结果：

Initial code:        2.71 s
Numba (sequential):  0.13 s
Numba (parallel):    0.12 s   (x22 times faster than the initial code)

请注意，第一个运行由于而变慢。在这种情况下，Numba 版本仍然快 10 倍左右。

NumPy 在给定边界框坐标内填充大型数组的值

NumPy filling values inside given bounding box coordinates for a large array

python

arrays

performance

numpy

numpy-ndarray