如何正确使用 anaconda 加速 GPU

Question

我正在尝试使用 anaconda accelerate 快速计算矩阵。我从非常基本的示例开始：乘以 2 个矩阵。

我的目标是以某种方式获得比平常更好的 GPU 乘法 numpy.dot

这是我的基本示例，基于此 documentation。

from numbapro import guvectorize
from numpy import arange

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='gpu')
def matmul(A, B, C):
    m, n = A.shape
    n, p = B.shape
    for i in range(m):
        for j in range(p):
            C[i, j] = 0
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]

import numpy as np
import time

for dim in [50, 100, 200]:
    rnd = np.random.RandomState(0)
    a = rnd.rand(dim, dim).astype(np.float32)
    b = rnd.rand(dim, dim).astype(np.float32)
    resgpu = np.zeros_like(a)

    start = time.time()
    rescpu = np.dot(a, b)
    print('CPU:', time.time() - start)

    start = time.time()
    resgpu = matmul(a, b)
    print('GPU:', time.time() - start)

    print(np.allclose(rescpu, resgpu))
    print(np.allclose(resgpu, rescpu))

结果太糟糕了：GPU 比 CPU

慢得令人难以置信

CPU: 0.00011801719665527344
GPU: 0.05677294731140137
True
True
CPU: 0.00011205673217773438
GPU: 0.3881375789642334
True
True
CPU: 0.00038933753967285156
GPU: 3.018171787261963
True
True

当然我知道numpy内部实现优化的很好，但我希望anaconda官方的例子不错。我正在使用 python 3.4.3 并在使用这两个帮助库时遇到错误：http://www.cs.toronto.edu/~tijmen/gnumpy.html and https://github.com/rctn/gpupy

我应该说，使用 gpupy 我在 python 2.7 上成功加速。

所以我的问题是：如何使用 GPU 获得比 numpy-CPU 更好的矩阵乘法？ anaconda 官方示例有什么问题，是否有 python3 的工作库允许以 numpy 方式使用 GPU？

===

结果

可惜没有简单好办法python 3，改用2.7

感谢@rth 推荐很棒的库scikits.cuda

Available functions

一些基准测试（使用 anaconda mkl 测试，所以 numpy 也很快）

dim = 10000
rnd = np.random.RandomState(0)
a = rnd.rand(dim, dim).astype(np.float32)
b = rnd.rand(dim, dim).astype(np.float32)
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)

start = time.time()
rescpu = np.dot(a, b)
print 'CPU:', time.time() - start

start = time.time()
resgpu = culinalg.dot(a_gpu, b_gpu)
print 'GPU:', time.time() - start

resgpu = resgpu.get()
print np.allclose(rescpu, resgpu)
print np.allclose(resgpu, rescpu)

和结果

CPU: 16.4765479565
GPU: 0.000520944595337

Answer 1

您应该看看为经典线性代数运算提供高度优化例程的 BLAS 实现。密集矩阵的乘法是用 gemm 函数执行的。

例如，如果针对优化的 BLAS 实现（OpenBLAS、ATLAS、MKL 等）进行编译，numpy 中的矩阵乘法将得到显着改进。
对于 GPU，NVIDIA 提供了 cuBLAS 实现。根据您正在使用的 answer, it can be called with numpy arrays using scikits.cuda module. Anaconda accelerate，还提供了对 cuBLAS 的直接绑定。

顺便说一句，如果你想对矩阵乘法的 CPU 与 GPU 性能进行基准测试，你还应该指定 Numpy 用于 CPU 计算的 BLAS，因为结果可能因顺序而异数量级（参见 this benchmark）。

如何正确使用 anaconda 加速 GPU

How to properly use anaconda accelerate for GPU

numpy

anaconda

python-3.4

numba-pro