PyTorch 分类比 Tensorflow 慢得多:42 分钟对 11 分钟

Classification with PyTorch is much slower than Tensorflow: 42min vs. 11min

我已经是Tensorflow用户,开始使用Pytorch。作为试验,我用两个库实现了简单的分类任务。
但是,PyTorch 比 Tensorflow 慢得多:Pytorch 需要 42 分钟,而 TensorFlow 需要 11 分钟。我参考了 PyTorch official Tutorial,并且只做了很少的改动。

谁能就这个问题分享一些建议?

这是我尝试过的总结。

environment: Colab Pro+
dataset: Cifar10
classifier: VGG16
optimizer: Adam
loss: crossentropy
batch size: 32

PyTorch
代码:

import torch, torchvision
from torch import nn
from torchvision import transforms, models
from tqdm import tqdm
import time, copy

trans = transforms.Compose([transforms.Resize((224, 224)),
                            transforms.ToTensor(),])

data = {phase: torchvision.datasets.CIFAR10('./', train = (phase=='train'),  transform=trans, download=True) for phase in ['train', 'test']}
dataloaders = {phase: torch.utils.data.DataLoader(data[phase], batch_size=32, shuffle=True) for phase in ['train', 'test']}

def train_model(model, criterion, optimizer, dataloaders, device, num_epochs=5):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in tqdm(iter(dataloaders[phase])):
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase])
            epoch_acc = running_corrects.double() / len(dataloaders[phase])

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'test' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = models.vgg16(pretrained=False)
model = model.to(device)

model = train_model(model=model,
                    criterion=nn.CrossEntropyLoss(), 
                    optimizer=torch.optim.Adam(model.parameters(), lr=0.001),
                    dataloaders=dataloaders,
                    device=device,
                    )

结果:

Epoch 0/4
----------
  0%|          | 0/1563 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
100%|██████████| 1563/1563 [07:50<00:00,  3.32it/s]
train Loss: 75.5199 Acc: 3.2809
100%|██████████| 313/313 [00:38<00:00,  8.11it/s]
test Loss: 73.7274 Acc: 3.1949

Epoch 1/4
----------
100%|██████████| 1563/1563 [07:50<00:00,  3.33it/s]
train Loss: 73.8162 Acc: 3.2514
100%|██████████| 313/313 [00:38<00:00,  8.13it/s]
test Loss: 73.6114 Acc: 3.1949

Epoch 2/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7741 Acc: 3.1369
100%|██████████| 313/313 [00:38<00:00,  8.11it/s]
test Loss: 73.5873 Acc: 3.1949

Epoch 3/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7493 Acc: 3.1331
100%|██████████| 313/313 [00:38<00:00,  8.12it/s]
test Loss: 73.6191 Acc: 3.1949

Epoch 4/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7289 Acc: 3.1939
100%|██████████| 313/313 [00:38<00:00,  8.13it/s]test Loss: 73.5955 Acc: 3.1949

Training complete in 42m 22s
Best val Acc: 3.194888

张量流
代码:

import tensorflow_datasets as tfds
from tensorflow.keras import applications, models
import tensorflow as tf
import time

ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])

def resize(ip):
    image = ip['image']
    label = ip['label']
    image = tf.image.resize(image, (224, 224))
    image = tf.expand_dims(image,0)
    label = tf.one_hot(label,10)
    label = tf.expand_dims(label,0)
    return (image, label)

ds_train_ = ds_train.map(resize)
ds_test_ = ds_test.map(resize)


model = applications.vgg16.VGG16(input_shape = (224, 224, 3), weights=None, classes=10)
model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics= ['accuracy'])

batch_size = 32
since = time.time()
history = model.fit(ds_train_,
                    batch_size = batch_size,
                    steps_per_epoch = len(ds_train)//batch_size,
                    epochs = 5,
                    validation_steps = len(ds_test),
                    validation_data = ds_test_,
                    shuffle = True,)
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format( time_elapsed // 60, time_elapsed % 60 ))

结果:

Epoch 1/5
1562/1562 [==============================] - 125s 69ms/step - loss: 36.9022 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 2/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3031 - accuracy: 0.1005 - val_loss: 2.3033 - val_accuracy: 0.1000
Epoch 3/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3035 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 4/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3038 - accuracy: 0.1024 - val_loss: 2.3030 - val_accuracy: 0.1000
Epoch 5/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3028 - accuracy: 0.1024 - val_loss: 2.3033 - val_accuracy: 0.1000
Training complete in 11m 23s

这是因为在您的 tensorflow 代码中,数据管道每一步将一批 1 张图像送入模型,而不是一批 32 张图像。

batch_size 传递到 model.fit 并不会 在数据为数据集形式时真正控制批处理大小。它从日志中显示每个时期看似正确的步骤的原因是您将 steps_per_epoch 传递给 model.fit

要正确设置批量大小:

ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])

def resize(ip):
    image = ip['image']
    label = ip['label']
    image = tf.image.resize(image, (224, 224))
    label = tf.one_hot(label,10)
    return (image, label)

train_size=len(ds_train)
test_size=len(ds_test)
ds_train_ = ds_train.shuffle(train_size).batch(32).map(resize)
ds_test_ = ds_test.shuffle(test_size).batch(32).map(resize)

model.fit 通话:

history = model.fit(ds_train_,
                    epochs = 1,
                    validation_data = ds_test_)

修复问题后,tensorflow 获得了与pytorch 相似的速度性能。在我的机器上,pytorch 每个 epoch 花费约 27 分钟,而 tensorflow 每个 epoch 花费约 24 分钟。

根据 NVIDIA 的基准测试,pytorch 和 tensorflow 在具有真实世界数据集和问题规模的大多数流行深度学习应用程序中具有相似的速度性能。 (参考:https://developer.nvidia.com/deep-learning-performance-training-inference