计算大量 3x3 点积的最快方法

Fastest way to compute large number of 3x3 dot product

我必须计算大量 3x3 线性变换(例如旋转)。这是我目前所拥有的:

import numpy as np
from scipy import sparse
from numba import jit

n = 100000 # number of transformations
k = 100 # number of vectors for each transformation

A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag

def dot1():
    """ naive approach: many times np.dot """
    return np.stack([np.dot(o, a) for o, a in zip(Op, A)])

@jit(nopython=True)
def dot2():
    """ same as above, but jitted """
    new = np.empty_like(A)
    for i in range(Op.shape[0]):
        new[i] = np.dot(Op[i], A[i])
    return new

def dot3():
    """ using einsum """
    return np.einsum("ijk,ikl->ijl", Op, A)

def dot4():
    """ using sparse block diag matrix """
    return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)

在 macbook pro 2012 上,这给了我:

In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

除了朴素的方法外,所有方法都是相似的。有没有办法显着加快这一过程?

编辑

(可用cuda的方法是最好的。以下是非cuda版本的比较)

根据大家的建议,我修改了dot2,增加了Op@A方法,并在.

的基础上做了一个版本
@njit(fastmath=True, parallel=True)
def dot2(Op, A):
    """ same as above, but jitted """
    new = np.empty_like(A)
    for i in prange(Op.shape[0]):
        new[i] = np.dot(Op[i], A[i])
    return new

def dot5(Op, A):
    """ using matmul """
    return Op@A

@njit(fastmath=True, parallel=True)
def dot6(Op, A):
    """ another numba.jit with parallel (based on #59356461) """
    new = np.empty_like(A)
    for i_n in prange(A.shape[0]):
        for i_k in range(A.shape[2]):
            for i_x in range(3):
                acc = 0.0j
                for i_y in range(3):
                    acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
                new[i_n, i_x, i_k] = acc
    return new


这是我(在另一台机器上)用 benchit:

得到的
def gen(n, k):
    Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
    A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
    return Op, A

# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}

t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)

按照@hpaulj 在评论中的建议使用Op@A

这里是使用benchit的比较:

def dot1(A,Op):
    """ naive approach: many times np.dot """
    return np.stack([np.dot(o, a) for o, a in zip(Op, A)])

@jit(nopython=True)
def dot2(A,Op):
    """ same as above, but jitted """
    new = np.empty_like(A)
    for i in range(Op.shape[0]):
        new[i] = np.dot(Op[i], A[i])
    return new

def dot3(A,Op):
    """ using einsum """
    return np.einsum("ijk,ikl->ijl", Op, A)

def dot4(A,Op):
    n = A.shape[0]
    sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
    """ using sparse block diag matrix """
    return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)

def dot5(A,Op):
  return Op@A

in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}

它们在更大范围内的性能似乎接近,dot5 稍快一些。

你得到了一些很好的建议,但由于这个特定目标,我想再添加一个:

Is there a way to accelerate this significantly?

实际上,如果您需要这些操作 显着 更快(这通常意味着 > 10 倍),您可能希望使用 GPU 进行矩阵乘法。举个简单的例子:

import numpy as np
import cupy as cp

n = 100000 # number of transformations
k = 100 # number of vectors for each transformation

# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators

def dot5(): # the suggested, best CPU approach
    return Op@A


# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)

# run once to ignore JIT overhead before benchmarking
gOp@gA;

%timeit dot5()
%timeit gOp@gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在一个回答中,Nick 提到使用 GPU - 这当然是最好的解决方案。

但是 - 作为一般规则 - 你所做的可能 CPU 有限。因此(GPU 方法除外),如果您利用机器上的所有内核并行工作,您可以获得的最好效果。

因此,您需要使用 multiprocessing(不是 python 的多线程!),将作业拆分成 运行每个核心并联。

这不是小事,但也不是太难,网上有很多好的examples/guides。

但是如果你有一台 8 核机器,它可能会给你几乎 8 倍的速度提升 只要你通过尝试在进程之间传递许多小对象来小心避免内存瓶颈, 但在开始时将它们全部分组