Keras 预处理:样本数量

Keras preprocessing: number of samples

我一直在使用keras预处理方法keras.preprocessing.image_dataset_from_directory()

这是我的 x 和 y 训练批次:

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    train_path,
    label_mode = 'categorical', #it is used for multiclass classification. It is one hot encoded labels for each class
    validation_split = 0.2,     #percentage of dataset to be considered for validation
    subset = "training",        #this subset is used for training
    seed = 1337,                # seed is set so that same results are reproduced
    image_size = img_size,      # shape of input images
    batch_size = batch_size,    # This should match with model batch size
)

valid_ds = tf.keras.preprocessing.image_dataset_from_directory(
    train_path,
    label_mode ='categorical',
    validation_split = 0.2,
    subset = "validation",      #this subset is used for validation
    seed = 1337,
    image_size = img_size,
    batch_size = batch_size,
)

我想知道是否有办法为每个 class 收集相同的样​​本量?

您可以在下面看到目标目录中每个 class 的样本图像数量:

回顾一下评论中的内容:问题是关于不平衡的数据集,在不采取任何措施的情况下在不平衡的数据集上训练模型显然会导致模型有偏差。

为了解决这个问题,Keras.fit() 有一个名为 class_weight 的参数。我引用文档中给出的描述:

class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.

现在要计算您的 class 权重,您可以使用此公式并手动计算,对于每个 class j:

w_j= total_number_samples / (n_classes * n_samples_j)

示例:

A: 50
B: 100
C: 200

wa = 350/(3*50) = 2.3
wb =  350/(3*100) = 1.16
wc =  350/(3*200) = 0.58

或者你可以使用 scikit-learn:

#Import the function
from sklearn.utils import class_weight

# get class weights
class_weights = class_weight.compute_class_weight('balanced',
                                             np.unique(y_train),
                                             y_train)

# use the class weights for training
model.fit(X_train, y_train, class_weight=class_weights)