Jetslon Nano Numba Jeep 矢量添加基准测试

Jetson Nano Numba GPU Vector Add benchmarking

我正在尝试在 GPU 上添加随机向量而不是 CPU 使用 Numba 矢量化。

这是我的例子:

import numpy as np
from timeit import default_timer as timer
from numba import vectorize

TARGET = 'cpu'
#TARGET = 'cuda'

@vectorize(["float64(float64, float64)"], target=TARGET)
def VectorAdd(a, b):
    return a + b

def main():
    N = 32_000_000

    A = np.random.randn(N)
    B = np.random.randn(N)
    C = np.zeros(N, dtype=np.float64)

    print("Target unit: {}, number: {}".format(TARGET, N))
    start = timer()
    C = VectorAdd(A, B)
    vADD_time = timer() - start

    print("C[:5] = " + str(C[:5]))
    print("C[-5:] = " + str(C[-5:]))

    print("Time: {}".format(vADD_time))


if __name__ == "__main__":
    main()

CPU 比 CUDA 快 30 倍。我做错了什么?因为我希望 CUDA 必须更快。

Target unit: cuda, number: 32000000
C[:5] = [ 1.90362553 -2.6426849  -1.84243752 -0.00806387  0.63785922]
C[-5:] = [ 0.93794028  0.98118905  0.80945834  0.64350251 -1.62342203]
Time: 17.02285827000003

Target unit: cpu, number: 32000000
C[:5] = [ 0.77441334  0.35994057 -0.15359408 -0.20547891 -2.04108084]
C[-5:] = [1.47338646 3.01013048 0.71417303 1.62773266 2.80878941]
Time: 0.5268858470000168

您正在执行的操作过于简单,无法充分利用 GPU 提供的并行性;相反,您只会因内存传输开销而失去性能。

尝试运行下面的代码,它不会通过手动移动数据来衡量数据传输所花费的时间。

import numpy as np
from timeit import default_timer as timer
from numba import (vectorize, cuda)

# TARGET = 'cpu'
TARGET = 'cuda'


@vectorize(["float64(float64, float64)"], target=TARGET)
def VectorAdd(a, b):
    return a + b


def main():
    N = 32_000_000
    A = np.random.randn(N)
    B = np.random.randn(N)
    C = np.zeros(N, dtype=np.float64)

    A = cuda.to_device(A)
    B = cuda.to_device(B)

    print("Target unit: {}, number: {}".format(TARGET, N))
    start = timer()
    C = VectorAdd(A, B)
    vADD_time = timer() - start
    C = C.copy_to_host()
    print("C[:5] = " + str(C[:5]))
    print("C[-5:] = " + str(C[-5:]))
    print("Time: {}".format(vADD_time))


if __name__ == "__main__":
    main()

此外,我建议增加执行的迭代次数以查看 GPU 的加速。