能否将大型数据集的总和矢量化到数组的特定记录元素上?
Can one vectorize the summing of a large data set onto record-specific elements of an array?
我有一个非常大的数据集(数十亿条记录),我需要在一个二维数组上求和。对于每个值,都有索引指定该值应添加到数组的哪个元素:
import numpy as np
I = [0, 2, 1, 2, 1]
J = [1, 2, 1, 2, 1]
X = [2., 5., 0., 6., 4.]
A = np.zeros((3,3), dtype = 'f')
for i in range(len(I)) :
A[I[i], J[i]] += X[i]
结果
> print(A)
[[ 0. 2. 0.]
[ 0. 4. 0.]
[ 0. 0. 11.]]
我的问题:有没有办法将上述操作矢量化,从而消除for循环?
您的索引数组非常适合花哨的索引。在最简单的情况下,你可以做
A[I, J] += X
如果您有任何重复的索引,即您想要多次增加 A
中的某个位置,更可靠的方法是
np.add.at(A, (I, J), X)
这是一个完全矢量化的解决方案:
X = [\
[(0,1), 2. ], \
[(2,2), 5. ], \
[(1,1), 0. ], \
[(2,2), 6. ], \
[(1,1), 4. ] \
]
# create a dataframe with x, y, and val. I'm not doing it very.
# efficiently here - but since you control the data structure
# you can just start from this kind of dataframe.
records = [(r[0], r[1], t) for r,t in X]
df = pd.DataFrame.from_records(records, columns=["x", "y", "val"])
A = np.zeros((3,3), dtype = 'float64')
df = df.groupby(["x", "y"], as_index = False).sum()
A[df.x, df.y] = df.val
输出:
array([[ 0., 2., 0.],
[ 0., 4., 0.],
[ 0., 0., 11.]])
我有一个非常大的数据集(数十亿条记录),我需要在一个二维数组上求和。对于每个值,都有索引指定该值应添加到数组的哪个元素:
import numpy as np
I = [0, 2, 1, 2, 1]
J = [1, 2, 1, 2, 1]
X = [2., 5., 0., 6., 4.]
A = np.zeros((3,3), dtype = 'f')
for i in range(len(I)) :
A[I[i], J[i]] += X[i]
结果
> print(A)
[[ 0. 2. 0.]
[ 0. 4. 0.]
[ 0. 0. 11.]]
我的问题:有没有办法将上述操作矢量化,从而消除for循环?
您的索引数组非常适合花哨的索引。在最简单的情况下,你可以做
A[I, J] += X
如果您有任何重复的索引,即您想要多次增加 A
中的某个位置,更可靠的方法是
np.add.at(A, (I, J), X)
这是一个完全矢量化的解决方案:
X = [\
[(0,1), 2. ], \
[(2,2), 5. ], \
[(1,1), 0. ], \
[(2,2), 6. ], \
[(1,1), 4. ] \
]
# create a dataframe with x, y, and val. I'm not doing it very.
# efficiently here - but since you control the data structure
# you can just start from this kind of dataframe.
records = [(r[0], r[1], t) for r,t in X]
df = pd.DataFrame.from_records(records, columns=["x", "y", "val"])
A = np.zeros((3,3), dtype = 'float64')
df = df.groupby(["x", "y"], as_index = False).sum()
A[df.x, df.y] = df.val
输出:
array([[ 0., 2., 0.],
[ 0., 4., 0.],
[ 0., 0., 11.]])