如何使 Alea 更快?
How to make Alea faster?
在 Alea 中完成了一些实现各种 ML 算法的工作后,我尝试在 Alea 中对一些简单但基本的例程进行基准测试。令我惊讶的是,Alea' 比同等的 cuBLAS 调用 sgeam 做同样的事情花费的时间大约长 3 倍。如果我正在做一些更复杂的事情,比如矩阵乘法,我必须兼顾共享内存,这是可以理解的,但下面只是简单的数组转换。
let dmat = createRandomUniformMatrix 100 1000 1.0f 0.0f
let dmat2 = createRandomUniformMatrix 100 1000 1.0f 0.0f
let rmat = createEmptyMatrixLike dmat
let m = new DeviceUnaryTransformModule<float32> <@ fun x -> x*2.0f @>
#time
//4.85s/100k
for i=1 to 100000 do
m.Apply(dmat, rmat) |> ignore
#time
#time
//1.8s/100k
for i=1 to 100000 do
sgeam2 nT nT 2.0f dmat 0.0f dmat2 rmat |> ignore
#time
DeviceUnaryTransformModule 转换模块的内核与基本转换示例中的内核相同,唯一的区别是之后不是收集到主机,而是将数据保存在设备上。
此外,Unbound 的 reduce 对我来说真的很糟糕,事实上我一直在使用它的方式一定有错误。它比使用 sgeamv 两次对矩阵求和慢大约 20 倍。
let makeReduce (op:Expr<'T -> 'T -> 'T>) =
let compileReductionKernel (op:Expr<'T -> 'T -> 'T>) =
worker.LoadProgram(
DeviceReduceImpl.DeviceReduce(op, worker.Device.Arch, PlatformUtil.Instance.ProcessBitness).Template
)
let prog = compileReductionKernel op
let runReduceProgram (sumProg : Program<DeviceReduceImpl.IDeviceReduceFactory<'A>>) (x: DeviceMemory<'A>) =
sumProg.Entry.Create(blob, x.Length)
.Reduce(None, x.Ptr, x.Length)
let reduceProg (x: DeviceMemory<'T>) = runReduceProgram prog x
reduceProg
let sumReduce: DeviceMemory<float32> -> float32 = makeReduce <@ fun (a:float32) b -> a + b @>
#time
//3.5s/10k
for i=1 to 10000 do
sumReduce dmat.dArray |> ignore
#time
我没有尝试将它与 CUDA C++ 进行比较,但对于简单的事情,我认为它应该与 cuBLAS 相提并论。我以为优化标志可能已经关闭,但后来发现它默认是打开的。我在这里缺少任何优化提示吗?
我认为您的测试代码存在一些问题:
在您的映射模块中,您应该预加载 GPUModule。 GPUModule 在第一次启动时是 JIT 编译的。所以实际上你的时间测量包括GPU代码编译时间;
在您的映射模块中,Alea 代码和 cublas 代码,您应该同步 worker(同步 CUDA 上下文)。 CUDA 编程是异步风格的。因此,当您启动内核时,它会立即 returns 而无需等待内核完成。如果您不同步 worker,实际上您是在测量内核启动时间,而不是内核执行时间。 Alea gpu 的启动时间将比原生 C 代码慢,因为它会对内核参数进行一些编组。还有一些与内核启动时间相关的其他问题,我将在下面的示例代码中向您展示。
你的reduce测试实际上每次都加载了reduce模块!这意味着,每次减少时,您都会测量包括 GPU 编译时间在内的时间!建议您将 GPU 模块或程序的实例设为长寿命,因为它们代表编译后的 GPU 代码。
所以,我根据您的使用情况进行了测试。这里我先列出完整的测试代码:
#r @"packages\Alea.CUDA.2.1.2.3274\lib\net40\Alea.CUDA.dll"
#r @"packages\Alea.CUDA.IL.2.1.2.3274\lib\net40\Alea.CUDA.IL.dll"
#r @"packages\Alea.CUDA.Unbound.2.1.2.3274\lib\net40\Alea.CUDA.Unbound.dll"
#r "System.Configuration"
open System.IO
Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(@"packages\Alea.CUDA.2.1.2.3274", "private")
Alea.CUDA.Settings.Instance.Resource.Path <- Path.GetTempPath()
open Alea.CUDA
open Alea.CUDA.Utilities
open Alea.CUDA.CULib
open Alea.CUDA.Unbound
open Microsoft.FSharp.Quotations
type MapModule(target, op:Expr<float32 -> float32>) =
inherit GPUModule(target)
[<Kernel;ReflectedDefinition>]
member this.Kernel (C:deviceptr<float32>) (A:deviceptr<float32>) (B:deviceptr<float32>) (n:int) =
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable i = start
while i < n do
C.[i] <- __eval(op) A.[i] + __eval(op) B.[i]
i <- i + stride
member this.Apply(C:deviceptr<float32>, A:deviceptr<float32>, B:deviceptr<float32>, n:int) =
let lp = LaunchParam(64, 256)
this.GPULaunch <@ this.Kernel @> lp C A B n
let inline mapTemplate (op:Expr<'T -> 'T>) = cuda {
let! kernel =
<@ fun (C:deviceptr<'T>) (A:deviceptr<'T>) (B:deviceptr<'T>) (n:int) ->
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable i = start
while i < n do
C.[i] <- (%op) A.[i] + (%op) B.[i]
i <- i + stride @>
|> Compiler.DefineKernel
return Entry(fun program ->
let worker = program.Worker
let kernel = program.Apply kernel
let lp = LaunchParam(64, 256)
let run C A B n =
kernel.Launch lp C A B n
run ) }
let test1 (worker:Worker) m n sync iters =
let n = m * n
use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (no pre-load module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test2 (worker:Worker) m n sync iters =
let n = m * n
use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
// we pre-load the module, this will JIT compile the GPU code
m.GPUForceLoad()
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (pre-loaded module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test3 (worker:Worker) m n sync iters =
let n = m * n
use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
// we pre-load the module, this will JIT compile the GPU code
m.GPUForceLoad()
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
// since the worker is running in a background thread
// each cuda api will switch to that thread
// use eval() to avoid the many thread switching
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (pre-loaded module + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test4 (worker:Worker) m n sync iters =
use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
let n = m * n
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
program.Run C.Ptr A.Ptr B.Ptr n
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (template usage)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test5 (worker:Worker) m n sync iters =
use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
let n = m * n
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
program.Run C.Ptr A.Ptr B.Ptr n
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (template usage + worker.Eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test6 (worker:Worker) m n sync iters =
use cublas = new CUBLAS(worker)
let rng = System.Random(42)
use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmatr = worker.Malloc<float32>(m * n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (cublas)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test7 (worker:Worker) m n sync iters =
use cublas = new CUBLAS(worker)
let rng = System.Random(42)
use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmatr = worker.Malloc<float32>(m * n)
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (cublas + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test worker m n sync iters =
test6 worker m n sync iters
test7 worker m n sync iters
test1 worker m n sync iters
test2 worker m n sync iters
test3 worker m n sync iters
test4 worker m n sync iters
test5 worker m n sync iters
let testReduce1 (worker:Worker) n iters =
let rng = System.Random(42)
use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
// JIT compile and load GPU code for this module
reduceModule.GPUForceLoad()
// create a reducer which will allocate temp memory for maxNum=n
let reduce = reduceModule.Create(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to 10000 do
reduce.Reduce(input.Ptr, n) |> ignore
timer.Stop()
printfn "%f ms / %d (pre-load gpu code)" timer.Elapsed.TotalMilliseconds iters
let testReduce2 (worker:Worker) n iters =
let rng = System.Random(42)
use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
// JIT compile and load GPU code for this module
reduceModule.GPUForceLoad()
// create a reducer which will allocate temp memory for maxNum=n
let reduce = reduceModule.Create(n)
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to 10000 do
reduce.Reduce(input.Ptr, n) |> ignore
timer.Stop()
printfn "%f ms / %d (pre-load gpu code and avoid thread switching)" timer.Elapsed.TotalMilliseconds iters
let testReduce worker n iters =
testReduce1 worker n iters
testReduce2 worker n iters
let workerDefault = Worker.Default
let workerNoThread = Worker.CreateOnCurrentThread(Device.Default)
在 Alea GPU 中,一个 worker 代表一个 CUDA 上下文,目前,我们正在使用一个 GPU 使用一个专用线程的模式,并在该线程上附加 CUDA 上下文。我们称之为 "worker with dedicated thread"。所以这也意味着,每次你调用 CUDA API,比如内核启动,我们必须切换到工作线程。如果您正在进行大量内核启动,建议使用 Worker.Eval
函数在工作线程内执行您的代码以避免线程切换。还有一个在当前线程上创建worker的实验性特性,从而避免了线程切换,但我们仍在优化这种用法。详情请参考here
现在我们先用默认worker做一个不同步worker的测试(也就是说我们只比较内核启动时间)。默认的 worker 是一个有专用线程的 worker,所以你可以看到当我们使用 Worker.Eval
时它的性能更好。但总体而言,从 .net 启动的内核比原生 C 内核启动慢:
> test workerDefault 10000 10000 false 100;;
4.487300 ms / 100 nosync (cublas)
0.560600 ms / 100 nosync (cublas + worker.eval)
304.427900 ms / 100 nosync (no pre-load module)
18.517000 ms / 100 nosync (pre-loaded module)
12.579100 ms / 100 nosync (pre-loaded module + worker.eval)
27.023800 ms / 100 nosync (template usage)
16.007500 ms / 100 nosync (template usage + worker.Eval)
val it : unit = ()
> test workerDefault 10000 10000 false 100;;
3.288600 ms / 100 nosync (cublas)
0.647300 ms / 100 nosync (cublas + worker.eval)
29.129100 ms / 100 nosync (no pre-load module)
18.874700 ms / 100 nosync (pre-loaded module)
12.285000 ms / 100 nosync (pre-loaded module + worker.eval)
20.452300 ms / 100 nosync (template usage)
14.903500 ms / 100 nosync (template usage + worker.Eval)
val it : unit = ()
另外,你可能注意到了,我运行这个测试了两次,第一次没有预加载模块的测试使用了304毫秒,但是第二次没有预加载模块的测试模块仅使用 29 毫秒。原因是,我们使用 LLVM P/Invoke 来编译内核。而那些P/Invoke函数是惰性函数,所以在你第一次使用的时候,他们有一些初始化,之后,它变得更快。
现在,我们同步worker,它实际上测量了真正的内核执行时间,现在它们是相似的。我在这里创建的内核非常简单,但是它对矩阵 A 和 B 都进行了操作:
> test workerDefault 10000 10000 true 100;;
843.695000 ms / 100 sync (cublas)
841.452400 ms / 100 sync (cublas + worker.eval)
919.244900 ms / 100 sync (no pre-load module)
912.348000 ms / 100 sync (pre-loaded module)
908.909000 ms / 100 sync (pre-loaded module + worker.eval)
914.834100 ms / 100 sync (template usage)
914.170100 ms / 100 sync (template usage + worker.Eval)
现在如果我们在无线程 worker 上测试它们,它们会快一点,因为没有线程切换:
> test workerNoThread 10000 10000 true 100;;
842.132100 ms / 100 sync (cublas)
841.627200 ms / 100 sync (cublas + worker.eval)
918.007800 ms / 100 sync (no pre-load module)
908.575900 ms / 100 sync (pre-loaded module)
908.770100 ms / 100 sync (pre-loaded module + worker.eval)
913.405300 ms / 100 sync (template usage)
913.942600 ms / 100 sync (template usage + worker.Eval)
现在是对 reduce 的测试:
> testReduce workerDefault 10000000 100;;
7691.335300 ms / 100 (pre-load gpu code)
6448.782500 ms / 100 (pre-load gpu code and avoid thread switching)
val it : unit = ()
> testReduce workerNoThread 10000000 100;;
6467.105300 ms / 100 (pre-load gpu code)
6426.296900 ms / 100 (pre-load gpu code and avoid thread switching)
val it : unit = ()
请注意,在这个缩减测试中,每次缩减都有一个内存收集(memcpyDtoH)以从设备到主机获取结果。并且此内存副本 API 调用自动同步工作人员,因为如果内核未完成,则该值毫无意义。所以如果你想用 C 代码比较性能,你还应该将结果标量从设备复制到主机。虽然它只是一个 CUDA api 调用,但是当你进行多次迭代(本例中为 100 次)时,它会在那里积累一些时间。
希望这能回答您的问题。
在 Alea 中完成了一些实现各种 ML 算法的工作后,我尝试在 Alea 中对一些简单但基本的例程进行基准测试。令我惊讶的是,Alea' 比同等的 cuBLAS 调用 sgeam 做同样的事情花费的时间大约长 3 倍。如果我正在做一些更复杂的事情,比如矩阵乘法,我必须兼顾共享内存,这是可以理解的,但下面只是简单的数组转换。
let dmat = createRandomUniformMatrix 100 1000 1.0f 0.0f
let dmat2 = createRandomUniformMatrix 100 1000 1.0f 0.0f
let rmat = createEmptyMatrixLike dmat
let m = new DeviceUnaryTransformModule<float32> <@ fun x -> x*2.0f @>
#time
//4.85s/100k
for i=1 to 100000 do
m.Apply(dmat, rmat) |> ignore
#time
#time
//1.8s/100k
for i=1 to 100000 do
sgeam2 nT nT 2.0f dmat 0.0f dmat2 rmat |> ignore
#time
DeviceUnaryTransformModule 转换模块的内核与基本转换示例中的内核相同,唯一的区别是之后不是收集到主机,而是将数据保存在设备上。
此外,Unbound 的 reduce 对我来说真的很糟糕,事实上我一直在使用它的方式一定有错误。它比使用 sgeamv 两次对矩阵求和慢大约 20 倍。
let makeReduce (op:Expr<'T -> 'T -> 'T>) =
let compileReductionKernel (op:Expr<'T -> 'T -> 'T>) =
worker.LoadProgram(
DeviceReduceImpl.DeviceReduce(op, worker.Device.Arch, PlatformUtil.Instance.ProcessBitness).Template
)
let prog = compileReductionKernel op
let runReduceProgram (sumProg : Program<DeviceReduceImpl.IDeviceReduceFactory<'A>>) (x: DeviceMemory<'A>) =
sumProg.Entry.Create(blob, x.Length)
.Reduce(None, x.Ptr, x.Length)
let reduceProg (x: DeviceMemory<'T>) = runReduceProgram prog x
reduceProg
let sumReduce: DeviceMemory<float32> -> float32 = makeReduce <@ fun (a:float32) b -> a + b @>
#time
//3.5s/10k
for i=1 to 10000 do
sumReduce dmat.dArray |> ignore
#time
我没有尝试将它与 CUDA C++ 进行比较,但对于简单的事情,我认为它应该与 cuBLAS 相提并论。我以为优化标志可能已经关闭,但后来发现它默认是打开的。我在这里缺少任何优化提示吗?
我认为您的测试代码存在一些问题:
在您的映射模块中,您应该预加载 GPUModule。 GPUModule 在第一次启动时是 JIT 编译的。所以实际上你的时间测量包括GPU代码编译时间;
在您的映射模块中,Alea 代码和 cublas 代码,您应该同步 worker(同步 CUDA 上下文)。 CUDA 编程是异步风格的。因此,当您启动内核时,它会立即 returns 而无需等待内核完成。如果您不同步 worker,实际上您是在测量内核启动时间,而不是内核执行时间。 Alea gpu 的启动时间将比原生 C 代码慢,因为它会对内核参数进行一些编组。还有一些与内核启动时间相关的其他问题,我将在下面的示例代码中向您展示。
你的reduce测试实际上每次都加载了reduce模块!这意味着,每次减少时,您都会测量包括 GPU 编译时间在内的时间!建议您将 GPU 模块或程序的实例设为长寿命,因为它们代表编译后的 GPU 代码。
所以,我根据您的使用情况进行了测试。这里我先列出完整的测试代码:
#r @"packages\Alea.CUDA.2.1.2.3274\lib\net40\Alea.CUDA.dll"
#r @"packages\Alea.CUDA.IL.2.1.2.3274\lib\net40\Alea.CUDA.IL.dll"
#r @"packages\Alea.CUDA.Unbound.2.1.2.3274\lib\net40\Alea.CUDA.Unbound.dll"
#r "System.Configuration"
open System.IO
Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(@"packages\Alea.CUDA.2.1.2.3274", "private")
Alea.CUDA.Settings.Instance.Resource.Path <- Path.GetTempPath()
open Alea.CUDA
open Alea.CUDA.Utilities
open Alea.CUDA.CULib
open Alea.CUDA.Unbound
open Microsoft.FSharp.Quotations
type MapModule(target, op:Expr<float32 -> float32>) =
inherit GPUModule(target)
[<Kernel;ReflectedDefinition>]
member this.Kernel (C:deviceptr<float32>) (A:deviceptr<float32>) (B:deviceptr<float32>) (n:int) =
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable i = start
while i < n do
C.[i] <- __eval(op) A.[i] + __eval(op) B.[i]
i <- i + stride
member this.Apply(C:deviceptr<float32>, A:deviceptr<float32>, B:deviceptr<float32>, n:int) =
let lp = LaunchParam(64, 256)
this.GPULaunch <@ this.Kernel @> lp C A B n
let inline mapTemplate (op:Expr<'T -> 'T>) = cuda {
let! kernel =
<@ fun (C:deviceptr<'T>) (A:deviceptr<'T>) (B:deviceptr<'T>) (n:int) ->
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable i = start
while i < n do
C.[i] <- (%op) A.[i] + (%op) B.[i]
i <- i + stride @>
|> Compiler.DefineKernel
return Entry(fun program ->
let worker = program.Worker
let kernel = program.Apply kernel
let lp = LaunchParam(64, 256)
let run C A B n =
kernel.Launch lp C A B n
run ) }
let test1 (worker:Worker) m n sync iters =
let n = m * n
use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (no pre-load module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test2 (worker:Worker) m n sync iters =
let n = m * n
use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
// we pre-load the module, this will JIT compile the GPU code
m.GPUForceLoad()
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (pre-loaded module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test3 (worker:Worker) m n sync iters =
let n = m * n
use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
// we pre-load the module, this will JIT compile the GPU code
m.GPUForceLoad()
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
// since the worker is running in a background thread
// each cuda api will switch to that thread
// use eval() to avoid the many thread switching
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (pre-loaded module + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test4 (worker:Worker) m n sync iters =
use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
let n = m * n
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
program.Run C.Ptr A.Ptr B.Ptr n
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (template usage)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test5 (worker:Worker) m n sync iters =
use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
let n = m * n
let rng = System.Random(42)
use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use C = worker.Malloc<float32>(n)
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
program.Run C.Ptr A.Ptr B.Ptr n
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (template usage + worker.Eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test6 (worker:Worker) m n sync iters =
use cublas = new CUBLAS(worker)
let rng = System.Random(42)
use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmatr = worker.Malloc<float32>(m * n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (cublas)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test7 (worker:Worker) m n sync iters =
use cublas = new CUBLAS(worker)
let rng = System.Random(42)
use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
use dmatr = worker.Malloc<float32>(m * n)
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to iters do
cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
if sync then worker.Synchronize()
timer.Stop()
printfn "%f ms / %d %s (cublas + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
let test worker m n sync iters =
test6 worker m n sync iters
test7 worker m n sync iters
test1 worker m n sync iters
test2 worker m n sync iters
test3 worker m n sync iters
test4 worker m n sync iters
test5 worker m n sync iters
let testReduce1 (worker:Worker) n iters =
let rng = System.Random(42)
use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
// JIT compile and load GPU code for this module
reduceModule.GPUForceLoad()
// create a reducer which will allocate temp memory for maxNum=n
let reduce = reduceModule.Create(n)
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to 10000 do
reduce.Reduce(input.Ptr, n) |> ignore
timer.Stop()
printfn "%f ms / %d (pre-load gpu code)" timer.Elapsed.TotalMilliseconds iters
let testReduce2 (worker:Worker) n iters =
let rng = System.Random(42)
use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
// JIT compile and load GPU code for this module
reduceModule.GPUForceLoad()
// create a reducer which will allocate temp memory for maxNum=n
let reduce = reduceModule.Create(n)
worker.Eval <| fun _ ->
let timer = System.Diagnostics.Stopwatch.StartNew()
for i = 1 to 10000 do
reduce.Reduce(input.Ptr, n) |> ignore
timer.Stop()
printfn "%f ms / %d (pre-load gpu code and avoid thread switching)" timer.Elapsed.TotalMilliseconds iters
let testReduce worker n iters =
testReduce1 worker n iters
testReduce2 worker n iters
let workerDefault = Worker.Default
let workerNoThread = Worker.CreateOnCurrentThread(Device.Default)
在 Alea GPU 中,一个 worker 代表一个 CUDA 上下文,目前,我们正在使用一个 GPU 使用一个专用线程的模式,并在该线程上附加 CUDA 上下文。我们称之为 "worker with dedicated thread"。所以这也意味着,每次你调用 CUDA API,比如内核启动,我们必须切换到工作线程。如果您正在进行大量内核启动,建议使用 Worker.Eval
函数在工作线程内执行您的代码以避免线程切换。还有一个在当前线程上创建worker的实验性特性,从而避免了线程切换,但我们仍在优化这种用法。详情请参考here
现在我们先用默认worker做一个不同步worker的测试(也就是说我们只比较内核启动时间)。默认的 worker 是一个有专用线程的 worker,所以你可以看到当我们使用 Worker.Eval
时它的性能更好。但总体而言,从 .net 启动的内核比原生 C 内核启动慢:
> test workerDefault 10000 10000 false 100;;
4.487300 ms / 100 nosync (cublas)
0.560600 ms / 100 nosync (cublas + worker.eval)
304.427900 ms / 100 nosync (no pre-load module)
18.517000 ms / 100 nosync (pre-loaded module)
12.579100 ms / 100 nosync (pre-loaded module + worker.eval)
27.023800 ms / 100 nosync (template usage)
16.007500 ms / 100 nosync (template usage + worker.Eval)
val it : unit = ()
> test workerDefault 10000 10000 false 100;;
3.288600 ms / 100 nosync (cublas)
0.647300 ms / 100 nosync (cublas + worker.eval)
29.129100 ms / 100 nosync (no pre-load module)
18.874700 ms / 100 nosync (pre-loaded module)
12.285000 ms / 100 nosync (pre-loaded module + worker.eval)
20.452300 ms / 100 nosync (template usage)
14.903500 ms / 100 nosync (template usage + worker.Eval)
val it : unit = ()
另外,你可能注意到了,我运行这个测试了两次,第一次没有预加载模块的测试使用了304毫秒,但是第二次没有预加载模块的测试模块仅使用 29 毫秒。原因是,我们使用 LLVM P/Invoke 来编译内核。而那些P/Invoke函数是惰性函数,所以在你第一次使用的时候,他们有一些初始化,之后,它变得更快。
现在,我们同步worker,它实际上测量了真正的内核执行时间,现在它们是相似的。我在这里创建的内核非常简单,但是它对矩阵 A 和 B 都进行了操作:
> test workerDefault 10000 10000 true 100;;
843.695000 ms / 100 sync (cublas)
841.452400 ms / 100 sync (cublas + worker.eval)
919.244900 ms / 100 sync (no pre-load module)
912.348000 ms / 100 sync (pre-loaded module)
908.909000 ms / 100 sync (pre-loaded module + worker.eval)
914.834100 ms / 100 sync (template usage)
914.170100 ms / 100 sync (template usage + worker.Eval)
现在如果我们在无线程 worker 上测试它们,它们会快一点,因为没有线程切换:
> test workerNoThread 10000 10000 true 100;;
842.132100 ms / 100 sync (cublas)
841.627200 ms / 100 sync (cublas + worker.eval)
918.007800 ms / 100 sync (no pre-load module)
908.575900 ms / 100 sync (pre-loaded module)
908.770100 ms / 100 sync (pre-loaded module + worker.eval)
913.405300 ms / 100 sync (template usage)
913.942600 ms / 100 sync (template usage + worker.Eval)
现在是对 reduce 的测试:
> testReduce workerDefault 10000000 100;;
7691.335300 ms / 100 (pre-load gpu code)
6448.782500 ms / 100 (pre-load gpu code and avoid thread switching)
val it : unit = ()
> testReduce workerNoThread 10000000 100;;
6467.105300 ms / 100 (pre-load gpu code)
6426.296900 ms / 100 (pre-load gpu code and avoid thread switching)
val it : unit = ()
请注意,在这个缩减测试中,每次缩减都有一个内存收集(memcpyDtoH)以从设备到主机获取结果。并且此内存副本 API 调用自动同步工作人员,因为如果内核未完成,则该值毫无意义。所以如果你想用 C 代码比较性能,你还应该将结果标量从设备复制到主机。虽然它只是一个 CUDA api 调用,但是当你进行多次迭代(本例中为 100 次)时,它会在那里积累一些时间。
希望这能回答您的问题。