为什么单个 10x10x3 的 Conv2d 占用 850mb 的 gpu

Question

在 Pytorch 中我正在优化一个模型。如果我运行以下代码，nvidia-smi 显示我在我的 gpu 上使用 850MiB / 7979MiB 内存。为什么会这样？

with torch.no_grad(): A = nn.Conv2d(10,10,3).cuda()

我想在某处指定了一些开销或默认分配大小，但我找不到此类文档。我确实记得 tensorflow 有一个设置来限制分配的内存量。

Answer 1

卷积不占用那么多内存。您可以使用 torch.cuda.memory_allocated 来验证这一点，它以字节为单位显示所有张量占用的内存：

torch.cuda.memory_allocated() # => 0

A = nn.Conv2d(10,10,3).cuda()

torch.cuda.memory_allocated() # => 4608

卷积只用了4608字节。

nvidia-smi 由于两个不同的原因显示较高的内存使用率。

缓存内存分配器

PyTorch 使用缓存内存分配器，这意味着它会保留比避免设备同步所需的更多内存。

来自 PyTorch CUDA Semantics - Memory Management:

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use memory_allocated() and max_memory_allocated() to monitor memory occupied by tensors, and use memory_reserved() and max_memory_reserved() to monitor the total amount of memory managed by the caching allocator.

CUDA 上下文

首次初始化 CUDA 时，它会创建一个管理设备控制的上下文。最值得注意的是，上下文包含所有不同 CUDA 内核的代码，其中 PyTorch 有很多。上下文的大小也因不同的 GPU 架构而异。 Issue #20532 - Couple hundred MB are taken just by initializing cuda .

中讨论了一些细节

您观察到的内存几乎完全归因于 CUDA 上下文。

为什么单个 10x10x3 的 Conv2d 占用 850mb 的 gpu

Why does a single Conv2d with 10x10x3 take up 850mb of gpu

python

memory

artificial-intelligence

pytorch

缓存内存分配器

CUDA 上下文