能否将大型数据集的总和矢量化到数组的特定记录元素上?

Can one vectorize the summing of a large data set onto record-specific elements of an array?

我有一个非常大的数据集(数十亿条记录),我需要在一个二维数组上求和。对于每个值,都有索引指定该值应添加到数组的哪个元素:

import numpy as np

I = [0, 2, 1, 2, 1]
J = [1, 2, 1, 2, 1]
X = [2., 5., 0., 6., 4.]

A = np.zeros((3,3), dtype = 'f')

for i in range(len(I)) :
    A[I[i], J[i]] += X[i]

结果

> print(A)
[[ 0.  2.  0.]
 [ 0.  4.  0.]
 [ 0.  0. 11.]]

我的问题:有没有办法将上述操作矢量化,从而消除for循环?

您的索引数组非常适合花哨的索引。在最简单的情况下,你可以做

A[I, J] += X

如果您有任何重复的索引,即您想要多次增加 A 中的某个位置,更可靠的方法是

np.add.at(A, (I, J), X)

这是一个完全矢量化的解决方案:

X = [\
     [(0,1), 2. ], \
     [(2,2), 5. ], \
     [(1,1), 0. ], \
     [(2,2), 6. ], \
     [(1,1), 4. ]  \
     ]
# create a dataframe with x, y, and val. I'm not doing it very. 
# efficiently here - but since you control the data structure 
# you can just start from this kind of dataframe. 
records = [(r[0], r[1], t) for r,t in X]
df = pd.DataFrame.from_records(records, columns=["x", "y", "val"])

A = np.zeros((3,3), dtype = 'float64')

df = df.groupby(["x", "y"], as_index = False).sum()
A[df.x, df.y] = df.val

输出:

array([[ 0.,  2.,  0.],
       [ 0.,  4.,  0.],
       [ 0.,  0., 11.]])