为什么 Torch 在使用 1.5mb 网络进行预测时使用 ~700mb 的 GPU 内存

Why does Torch use ~700mb of GPU memory when predicting with a 1.5mb network

我是 Torch/CUDA 的新手,我正在尝试从 https://github.com/1adrianb/binary-face-alignment 测试小型二进制网络 (~1.5mb),但我一直 运行ning 到 'out of memory' 问题。

我在 16.04 Ubuntu 使用 CUDA 10.0 和 CudNN 5.1 版时使用相对较弱的 GPU (NVIDIA Quadro K600) 和约 900Mb 的显存。所以我不太关心性能,但我认为我至少能够 运行 一个小型网络进行预测,一次一张图像(尤其是据称针对那些 "with Limited Resources" ).

我设法 运行 无头模式下的代码并检查内存消耗约为 700Mb,这可以解释为什么当我有一个 X-server 运行ning 时它立即失败大约 250Mb 的 GPU 内存。

我还添加了一些日志来查看 main.lua 我得到了多远,这是第一张图片上的调用 output:copy(model:forward(img)) 运行内存不足。

作为参考,这里是崩溃前的 main.lua 代码:

    require 'torch'
    require 'nn'
    require 'cudnn'
    require 'paths'

    require 'bnn'
    require 'optim'

    require 'gnuplot'
    require 'image'
    require 'xlua'
    local utils = require 'utils'
    local opts = require('opts')(arg)

    print("Starting heap tracking")
    torch.setheaptracking(true)

    torch.setdefaulttensortype('torch.FloatTensor')
    torch.setnumthreads(1)
    -- torch.

    local model
    if opts.dataset == 'AFLWPIFA' then
        print('Not available for the moment. Support will be added soon')
        os.exit()
        model = torch.load('models/facealignment_binary_pifa.t7')
    else
        print("Loading model")
        model = torch.load('models/facealignment_binary_aflw.t7')
    end
    model:evaluate()

    local fileLists = utils.getFileList(opts)
    local predictions = {}
    local noPoints = 68
    if opts.dataset == 'AFLWPIFA' then noPoints = 34; end
    local output = torch.CudaTensor(1,noPoints,64,64)
    for i = 1, #fileLists do

        local img = image.load(fileLists[i].image)
        local originalSize = img:size()

        img = utils.crop(img, fileLists[i].center, fileLists[i].scale, 256)
        img = img:cuda():view(1,3,256,256)
        output:copy(model:forward(img))

所以我有两个主要问题:

  1. 有哪些工具可以调试 torch 中的内存使用情况?
  2. 内存膨胀的可能原因是什么?

它一定不仅仅是网络和加载到 GPU 中的图像。我最好的猜测是它与 LoadFileLists 函数有关,但我对 torch 或 lua 的了解还不够多,无法从那里走得更远。其他答案表明确实不支持显示变量占用的内存量。

通常消耗大部分内存的是激活图(和训练时的梯度)。我不熟悉这个特定的模型和实现,但我会说你使用的是 "fake" 二进制网络; by fake 我的意思是他们仍然使用浮点数来表示二进制值,因为大多数用户将在不完全支持真正的二进制操作的 GPU 上使用他们的代码。作者甚至在第 5 节写道:

Performance. In theory, by replacing all floating-point multiplications with bitwise XOR and making use of the SWAR (Single instruction, multiple data within a register) [5], [6], the number of operations can be reduced up to 32x when compared against the multiplication-based convolution. However, in our tests, we observed speedups of up to 3.5x, when compared against cuBLAS, for matrix multiplications, a result being in accordance with those reported in [6]. We note that we did not conduct experiments on CPUs. However, given the fact that we used the same method for binarization as in [5], similar improvements in terms of speed, of the order of 58x, are to be expected: as the realvalued network takes 0.67 seconds to do a forward pass on a i7-3820 using a single core, a speedup close to x58 will allow the system to run in real-time. In terms of memory compression, by removing the biases, which have minimum impact (or no impact at all) on performance, and by grouping and storing every 32 weights in one variable, we can achieve a compression rate of 39x when compared against the single precision counterpart of Torch.

在这种情况下,小型模型(w.r.t。参数数量或模型大小(以 MiB 为单位)并不一定意味着内存占用量低。很可能所有这些内存都用于存储单精度或双精度的激活图。