运行 pytorch 内存不足
Running out of memory with pytorch
我正在尝试使用 huggingface 的 wav2vec 训练模型进行音频分类。我不断收到此错误:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: name, emotion, path.
***** Running training *****
Num examples = 2708
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 42
[ 2/42 : < :, Epoch 0.02/1]
Step Training Loss Validation Loss
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "<ipython-input-81-dd9fe3ea0f13>", line 77, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1073, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 732, in forward
hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 574, in forward
hidden_states = hidden_states + self.feed_forward(self.final_layer_norm(hidden_states))
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 510, in forward
hidden_states = self.intermediate_act_fn(hidden_states)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 11.44 MiB free; 10.68 GiB reserved in total by PyTorch)
我在 AWS ubuntu 深度学习 AMI ec2 上。
我一直在研究这个。我已经试过了:
- 减少批量大小(我想要 4,但我已经降到 1,错误没有改变)
- 添加:
import gc
gc.collect()
torch.cuda.empty_cache()
- 正在删除我的数据集中所有超过 6 秒的 wav 文件
还有什么我可以做的吗?我在安装了 105 GiB 的 p2.8xlarge 数据集上。
运行 torch.cuda.memory_summary(device=None, abbreviated=False)
给我:
|===========================================================================|\n| PyTorch CUDA memory summary, device ID 0 |\n|---------------------------------------------------------------------------|\n| CUDA OOMs: 3 | cudaMalloc retries: 4 |\n|===========================================================================|\n| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |\n|---------------------------------------------------------------------------|\n| Allocated memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| Active memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| GPU reserved memory | 10936 MB | 10960 MB | 63236 MB | 52300 MB |\n| from large pool | 10928 MB | 10954 MB | 63124 MB | 52196 MB |\n| from small pool | 8 MB | 98 MB | 112 MB | 104 MB |\n|---------------------------------------------------------------------------|\n| Non-releasable memory | 443755 KB | 1309 MB | 155426 MB | 154992 MB |\n| from large pool | 443551 KB | 1306 MB | 155081 MB | 154648 MB |\n| from small pool | 204 KB | 12 MB | 344 MB | 344 MB |\n|---------------------------------------------------------------------------|\n| Allocations | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| Active allocs | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| GPU reserved segments | 495 | 495 | 2169 | 1674 |\n| from large pool | 491 | 491 | 2113 | 1622 |\n| from small pool | 4 | 49 | 56 | 52 |\n|---------------------------------------------------------------------------|\n| Non-releasable allocs | 179 | 335 | 15998 | 15819 |\n| from large pool | 165 | 272 | 12420 | 12255 |\n| from small pool | 14 | 63 | 3578 | 3564 |\n|===========================================================================|\n'
仅将数据减少到长度小于 2 秒的输入后,它训练得更远但仍然出错:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running training *****
Num examples = 1411
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 22
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[11/22 01:12 < 01:28, 0.12 it/s, Epoch 0.44/1]
Step Training Loss Validation Loss Accuracy
10 2.428100 2.257138 0.300283
The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running Evaluation *****
Num examples = 353
Batch size = 32
Saving model checkpoint to trainingArgs/checkpoint-10
Configuration saved in trainingArgs/checkpoint-10/config.json
Model weights saved in trainingArgs/checkpoint-10/pytorch_model.bin
Configuration saved in trainingArgs/checkpoint-10/preprocessor_config.json
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
380 return
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
498 num_bytes = storage.size() * storage.element_size()
--> 499 zip_file.write_record(name, storage.data_ptr(), num_bytes)
500
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1334 self.control = self.callback_handler.on_step_end(args, self.state, self.control)
1335
-> 1336 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
1337 else:
1338 self.control = self.callback_handler.on_substep_end(args, self.state, self.control)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
1441
1442 if self.control.should_save:
-> 1443 self._save_checkpoint(model, trial, metrics=metrics)
1444 self.control = self.callback_handler.on_save(self.args, self.state, self.control)
1445
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
1531 elif self.args.should_save and not self.deepspeed:
1532 # deepspeed.save_checkpoint above saves model/optim/sched
-> 1533 torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
1534 with warnings.catch_warnings(record=True) as caught_warnings:
1535 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 380 return
381 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
382
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in __exit__(self, *args)
257
258 def __exit__(self, *args) -> None:
--> 259 self.file_like.write_end_of_file()
260 self.buffer.flush()
261
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 1849920000 vs 1849919888
当我在笔记本中 运行 !free
时,我得到:
The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
total used free shared buff/cache available
Mem: 503392908 6223452 478499292 346492 18670164 492641984
Swap: 0 0 0
对于训练代码,我基本上运行以这个 colab notebook 为例:
https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=6M8bNvLLJnG1
我要更改的只是传入的 data/labels,我特意将其放入教程笔记本中使用的相同目录结构中。出于某种原因,教程笔记本 运行 很好,即使我的数据具有可比性 size/num 类.
您可能会使用 Pytorch 中的 DataParallel
或 DistributedDataParallel
框架
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
在这种方法中,模型在每个设备 (gpu) 上得到复制,数据分布在设备之间
DataParallel splits your data automatically and sends job orders to
multiple models on several GPUs. After each model finishes their job,
DataParallel collects and merges the results before returning it to
you.
此处有更多示例 https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html。
如果模型不适合一个 gpu 的内存,则应采用模型并行方法。
根据您现有的模型,您可以通过 .to('cuda:0')
、.to('cuda:1')
等来判断哪个层位于哪个 gpu 上
class ModelParallelResNet50(ResNet):
def __init__(self, *args, **kwargs):
super(ModelParallelResNet50, self).__init__(
Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
self.seq1 = nn.Sequential(
self.conv1,
self.bn1,
self.relu,
self.maxpool,
self.layer1,
self.layer2
).to('cuda:0')
self.seq2 = nn.Sequential(
self.layer3,
self.layer4,
self.avgpool,
).to('cuda:1')
self.fc.to('cuda:1')
def forward(self, x):
x = self.seq2(self.seq1(x).to('cuda:1'))
return self.fc(x.view(x.size(0), -1))
由于您可能会损失性能,因此可能会使用流水线方法,即将输入数据进一步分块成批次,这些批次在不同设备上并行 运行。
我正在尝试使用 huggingface 的 wav2vec 训练模型进行音频分类。我不断收到此错误:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: name, emotion, path.
***** Running training *****
Num examples = 2708
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 42
[ 2/42 : < :, Epoch 0.02/1]
Step Training Loss Validation Loss
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "<ipython-input-81-dd9fe3ea0f13>", line 77, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1073, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 732, in forward
hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 574, in forward
hidden_states = hidden_states + self.feed_forward(self.final_layer_norm(hidden_states))
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 510, in forward
hidden_states = self.intermediate_act_fn(hidden_states)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 11.44 MiB free; 10.68 GiB reserved in total by PyTorch)
我在 AWS ubuntu 深度学习 AMI ec2 上。
我一直在研究这个。我已经试过了:
- 减少批量大小(我想要 4,但我已经降到 1,错误没有改变)
- 添加:
import gc gc.collect() torch.cuda.empty_cache()
- 正在删除我的数据集中所有超过 6 秒的 wav 文件
还有什么我可以做的吗?我在安装了 105 GiB 的 p2.8xlarge 数据集上。
运行 torch.cuda.memory_summary(device=None, abbreviated=False)
给我:
|===========================================================================|\n| PyTorch CUDA memory summary, device ID 0 |\n|---------------------------------------------------------------------------|\n| CUDA OOMs: 3 | cudaMalloc retries: 4 |\n|===========================================================================|\n| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |\n|---------------------------------------------------------------------------|\n| Allocated memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| Active memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| GPU reserved memory | 10936 MB | 10960 MB | 63236 MB | 52300 MB |\n| from large pool | 10928 MB | 10954 MB | 63124 MB | 52196 MB |\n| from small pool | 8 MB | 98 MB | 112 MB | 104 MB |\n|---------------------------------------------------------------------------|\n| Non-releasable memory | 443755 KB | 1309 MB | 155426 MB | 154992 MB |\n| from large pool | 443551 KB | 1306 MB | 155081 MB | 154648 MB |\n| from small pool | 204 KB | 12 MB | 344 MB | 344 MB |\n|---------------------------------------------------------------------------|\n| Allocations | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| Active allocs | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| GPU reserved segments | 495 | 495 | 2169 | 1674 |\n| from large pool | 491 | 491 | 2113 | 1622 |\n| from small pool | 4 | 49 | 56 | 52 |\n|---------------------------------------------------------------------------|\n| Non-releasable allocs | 179 | 335 | 15998 | 15819 |\n| from large pool | 165 | 272 | 12420 | 12255 |\n| from small pool | 14 | 63 | 3578 | 3564 |\n|===========================================================================|\n'
仅将数据减少到长度小于 2 秒的输入后,它训练得更远但仍然出错:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running training *****
Num examples = 1411
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 22
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[11/22 01:12 < 01:28, 0.12 it/s, Epoch 0.44/1]
Step Training Loss Validation Loss Accuracy
10 2.428100 2.257138 0.300283
The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running Evaluation *****
Num examples = 353
Batch size = 32
Saving model checkpoint to trainingArgs/checkpoint-10
Configuration saved in trainingArgs/checkpoint-10/config.json
Model weights saved in trainingArgs/checkpoint-10/pytorch_model.bin
Configuration saved in trainingArgs/checkpoint-10/preprocessor_config.json
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
380 return
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
498 num_bytes = storage.size() * storage.element_size()
--> 499 zip_file.write_record(name, storage.data_ptr(), num_bytes)
500
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1334 self.control = self.callback_handler.on_step_end(args, self.state, self.control)
1335
-> 1336 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
1337 else:
1338 self.control = self.callback_handler.on_substep_end(args, self.state, self.control)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
1441
1442 if self.control.should_save:
-> 1443 self._save_checkpoint(model, trial, metrics=metrics)
1444 self.control = self.callback_handler.on_save(self.args, self.state, self.control)
1445
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
1531 elif self.args.should_save and not self.deepspeed:
1532 # deepspeed.save_checkpoint above saves model/optim/sched
-> 1533 torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
1534 with warnings.catch_warnings(record=True) as caught_warnings:
1535 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 380 return
381 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
382
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in __exit__(self, *args)
257
258 def __exit__(self, *args) -> None:
--> 259 self.file_like.write_end_of_file()
260 self.buffer.flush()
261
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 1849920000 vs 1849919888
当我在笔记本中 运行 !free
时,我得到:
The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
total used free shared buff/cache available
Mem: 503392908 6223452 478499292 346492 18670164 492641984
Swap: 0 0 0
对于训练代码,我基本上运行以这个 colab notebook 为例: https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=6M8bNvLLJnG1
我要更改的只是传入的 data/labels,我特意将其放入教程笔记本中使用的相同目录结构中。出于某种原因,教程笔记本 运行 很好,即使我的数据具有可比性 size/num 类.
您可能会使用 Pytorch 中的 DataParallel
或 DistributedDataParallel
框架
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
在这种方法中,模型在每个设备 (gpu) 上得到复制,数据分布在设备之间
DataParallel splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes their job, DataParallel collects and merges the results before returning it to you.
此处有更多示例 https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html。
如果模型不适合一个 gpu 的内存,则应采用模型并行方法。
根据您现有的模型,您可以通过 .to('cuda:0')
、.to('cuda:1')
等来判断哪个层位于哪个 gpu 上
class ModelParallelResNet50(ResNet):
def __init__(self, *args, **kwargs):
super(ModelParallelResNet50, self).__init__(
Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
self.seq1 = nn.Sequential(
self.conv1,
self.bn1,
self.relu,
self.maxpool,
self.layer1,
self.layer2
).to('cuda:0')
self.seq2 = nn.Sequential(
self.layer3,
self.layer4,
self.avgpool,
).to('cuda:1')
self.fc.to('cuda:1')
def forward(self, x):
x = self.seq2(self.seq1(x).to('cuda:1'))
return self.fc(x.view(x.size(0), -1))
由于您可能会损失性能,因此可能会使用流水线方法,即将输入数据进一步分块成批次,这些批次在不同设备上并行 运行。