在图像数据集的几个不平衡 类 上进行平衡

Balancing on the several imbalanced classes of image dataset

我有一个数据集,在基本目录中有 12 个 classes。然而,这 12 classes 由几个图像组成。 12 classes 的图像数量不一致,因此会影响总精度。因此,我是否应该将数据增强应用于数据量较少的特定 classes?

每个class的图像数据:

#Dummy Classes

    [AAAA: 713
    ABCD: 274
    ACBD: 335
    ADBC: 576
    BBBB: 538
    BACD: 607
    BCAD: 253
    BDAD: 257
    CCCC: 463
    CABD: 309
    CBAD: 452
    CDAB: 762]

因此,我是否应该应用数据增强来增加较低 classes 中的数据量,以及我应用数据增强但它不会增加图像数据。除此之外,我想用原始数据生成增强数据,这意味着输入和输出目录将相同。

特定(个人 类)的增强代码:

from keras.preprocessing.image import ImageDataGenerator


datagen = ImageDataGenerator(
    rotation_range=45,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range = 0.2,
    zoom_range = 0.2, 
    horizontal_flip=True,
    fill_mode = 'reflect', cval = 125)

i = 0

for batch in datagen.flow_from_directory(directory = ('/content/dataset/ABCD'),
                                         batch_size = 317,
                                         target_size = (256, 256),
                                         color_mode = ('rgb'),
                                         save_to_dir = ('/content/dataset/ABCD'),
                                         save_prefix = ('aug'),
                                         save_format = ('png')):
  i += 1
  if i > 100:
    break

输出:Found 0 images belonging to 0 classes.

如前所述,我使用的是 flow_from_dataframe,因此您可以先为数据集创建一个 csv 文件,以防万一。我的想法是将当前数据集重复为每个标签的固定数量的样本,例如,您希望数据集中的每个标签有 762 个样本。这是我使用一些虚拟数据集的方法。

import numpy as np
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
import cv2

cv2.imwrite('temp.png',np.random.rand(3,3)) # Create a dummy image to be able to use flow_from_dataframe later

labels = [] # Create some unbalanced dataset
for i in range(10):
    labels.append('a')

for i in range(5):
    labels.append('b')

for i in range(3):
    labels.append('c') 

# Create a dataframe
df = pd.DataFrame({'img_path':['./temp.png']*len(labels),'label':labels})

# print(df.head())

def balance_data(df,target_size=12):
    """
    Increase the number of samples to number_of_samples for every label

        Example:
        Current size of the label a: 10
        Target size: 23

        repeat, mod = divmod(target_size,current_size) 
        2, 3 = divmod(23,10)

        Target size: current size * repeat + mod 

    Repeat this example for every label in the dataset.
    """

    df_groups = df.groupby(['label'])
    df_balanced = pd.DataFrame({key:[] for key in df.keys()})

    for i in df_groups.groups.keys():
        df_group = df_groups.get_group(i)
        df_label = df_group.sample(frac=1)
        current_size = len(df_label)

        if current_size >= target_size:
            # If current size is big enough, do nothing
            pass
        else:

            # Repeat the current dataset if it is smaller than target_size 
            repeat, mod = divmod(target_size,current_size)
            

            df_label_new = pd.concat([df_label]*repeat,ignore_index=True,axis=0)
            df_label_remainder = df_group.sample(n=mod)

            df_label_new = pd.concat([df_label_new,df_label_remainder],ignore_index=True,axis=0)

            # print(df_label_new)

        df_balanced = pd.concat([df_balanced,df_label_new],ignore_index=True,axis=0)


    return df_balanced

df_balanced = balance_data(df)
# print(df_balanced)

# A particular image will be transformed to its various versions within the augmentation step 
image_datagen = ImageDataGenerator(
    rotation_range=45,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range = 0.2,
    zoom_range = 0.2, 
    horizontal_flip=True,
    fill_mode = 'reflect', cval = 125)

image_generator = image_datagen.flow_from_dataframe(
            dataframe=df_balanced,
            x_col="img_path",
            y_col="label",
            class_mode="categorical",
            batch_size=4,
            shuffle=True
            )

# x,y=next(image_generator)

我希望代码是不言自明的。如果您需要进一步的帮助,请告诉我。