未从混洗数据集中选择 Keras ImageDataGenerator 验证拆分

Keras ImageDataGenerator validation split not selected from shuffled dataset

如何将我的图像数据集随机拆分为训练和验证数据集?更具体地说,Keras ImageDataGenerator 函数中的 validation_split 参数不是将我的图像随机拆分为训练和验证,而是从未改组的数据集中切分验证样本。

在 Keras 的 ImageDataGenerator 中指定 validation_split 参数时,会在数据打乱之前执行拆分,以便仅获取最后的 x 个样本。问题是最后一个被选作验证的数据样本可能不代表训练数据,因此它可能会失败。当您的图像数据存储在每个子文件夹都由 class 命名的公共目录中时,这是一个特别常见的死胡同。已在多个帖子中注明:

As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.

The training accuracy is very high, while the validation accuracy is very low?

please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.

Does 'validation split' randomly choose validation sample?

The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

指向 sklearn train_test_split() 作为解决方案,但我想提出一个不同的解决方案,以保持 keras 工作流程的一致性。

使用 split-folders 包,您可以将主数据目录随机拆分为训练、验证和测试(或只是训练和验证)目录。 class 特定的子文件夹会自动复制。

输入文件夹应具有以下格式:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

来自文档:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

通过这种新的文件夹排列,您可以轻松地使用 keras 数据生成器将您的数据划分为训练和验证,并最终训练您的模型。

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)