Jetslon Nano Numba Jeep 矢量添加基准测试
Jetson Nano Numba GPU Vector Add benchmarking
我正在尝试在 GPU 上添加随机向量而不是 CPU 使用 Numba 矢量化。
这是我的例子:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
TARGET = 'cpu'
#TARGET = 'cuda'
@vectorize(["float64(float64, float64)"], target=TARGET)
def VectorAdd(a, b):
return a + b
def main():
N = 32_000_000
A = np.random.randn(N)
B = np.random.randn(N)
C = np.zeros(N, dtype=np.float64)
print("Target unit: {}, number: {}".format(TARGET, N))
start = timer()
C = VectorAdd(A, B)
vADD_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(vADD_time))
if __name__ == "__main__":
main()
CPU 比 CUDA 快 30 倍。我做错了什么?因为我希望 CUDA 必须更快。
Target unit: cuda, number: 32000000
C[:5] = [ 1.90362553 -2.6426849 -1.84243752 -0.00806387 0.63785922]
C[-5:] = [ 0.93794028 0.98118905 0.80945834 0.64350251 -1.62342203]
Time: 17.02285827000003
Target unit: cpu, number: 32000000
C[:5] = [ 0.77441334 0.35994057 -0.15359408 -0.20547891 -2.04108084]
C[-5:] = [1.47338646 3.01013048 0.71417303 1.62773266 2.80878941]
Time: 0.5268858470000168
您正在执行的操作过于简单,无法充分利用 GPU 提供的并行性;相反,您只会因内存传输开销而失去性能。
尝试运行下面的代码,它不会通过手动移动数据来衡量数据传输所花费的时间。
import numpy as np
from timeit import default_timer as timer
from numba import (vectorize, cuda)
# TARGET = 'cpu'
TARGET = 'cuda'
@vectorize(["float64(float64, float64)"], target=TARGET)
def VectorAdd(a, b):
return a + b
def main():
N = 32_000_000
A = np.random.randn(N)
B = np.random.randn(N)
C = np.zeros(N, dtype=np.float64)
A = cuda.to_device(A)
B = cuda.to_device(B)
print("Target unit: {}, number: {}".format(TARGET, N))
start = timer()
C = VectorAdd(A, B)
vADD_time = timer() - start
C = C.copy_to_host()
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(vADD_time))
if __name__ == "__main__":
main()
此外,我建议增加执行的迭代次数以查看 GPU 的加速。
我正在尝试在 GPU 上添加随机向量而不是 CPU 使用 Numba 矢量化。
这是我的例子:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
TARGET = 'cpu'
#TARGET = 'cuda'
@vectorize(["float64(float64, float64)"], target=TARGET)
def VectorAdd(a, b):
return a + b
def main():
N = 32_000_000
A = np.random.randn(N)
B = np.random.randn(N)
C = np.zeros(N, dtype=np.float64)
print("Target unit: {}, number: {}".format(TARGET, N))
start = timer()
C = VectorAdd(A, B)
vADD_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(vADD_time))
if __name__ == "__main__":
main()
CPU 比 CUDA 快 30 倍。我做错了什么?因为我希望 CUDA 必须更快。
Target unit: cuda, number: 32000000
C[:5] = [ 1.90362553 -2.6426849 -1.84243752 -0.00806387 0.63785922]
C[-5:] = [ 0.93794028 0.98118905 0.80945834 0.64350251 -1.62342203]
Time: 17.02285827000003
Target unit: cpu, number: 32000000
C[:5] = [ 0.77441334 0.35994057 -0.15359408 -0.20547891 -2.04108084]
C[-5:] = [1.47338646 3.01013048 0.71417303 1.62773266 2.80878941]
Time: 0.5268858470000168
您正在执行的操作过于简单,无法充分利用 GPU 提供的并行性;相反,您只会因内存传输开销而失去性能。
尝试运行下面的代码,它不会通过手动移动数据来衡量数据传输所花费的时间。
import numpy as np
from timeit import default_timer as timer
from numba import (vectorize, cuda)
# TARGET = 'cpu'
TARGET = 'cuda'
@vectorize(["float64(float64, float64)"], target=TARGET)
def VectorAdd(a, b):
return a + b
def main():
N = 32_000_000
A = np.random.randn(N)
B = np.random.randn(N)
C = np.zeros(N, dtype=np.float64)
A = cuda.to_device(A)
B = cuda.to_device(B)
print("Target unit: {}, number: {}".format(TARGET, N))
start = timer()
C = VectorAdd(A, B)
vADD_time = timer() - start
C = C.copy_to_host()
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(vADD_time))
if __name__ == "__main__":
main()
此外,我建议增加执行的迭代次数以查看 GPU 的加速。