验证准确度上限为特定值,图表显示快速增长和下降

Validation Accuracy capped at certain value and graph shows rapid growths and falls

我正在尝试使用 keras 制作一个模型,该模型将根据用户的 imdb 数据预测用户是否喜欢电影。我的数据集是电影评级列表,它有大约 900 个样本。该模型根据评分(1-4 差,5-7 好,8-10 好)将样本分类为三个类别之一。该模型的精度上限约为 0.6,但是我修改了设置,它永远不会超过这个值,但精度图也是我所关心的,因为它显示出非常快速的增长和下降。我的问题基本上是,如果有人有任何建议,我可以做些什么来改进我的模型,使其更准确、更一致。

我的代码:

data = {'Name': [],
    'Rating': [],
    'Running': [],
    'Year': [],
    'Genre': [],
    'Votes': [],
    'Director': [],
    'Writer': [],
    'Production Company': [],
    'Actor1': [],
    'Actor2': [],
    'Actor3': [],
    'Actor4' : []}

labels = []

with open('ratings_expanded.csv', 'r', encoding='ISO-8859-1') as file:
    reader = csv.reader(file, delimiter = ',')
try:
    for row in reader:
        #data['Name'].append(row[0])
        data['Rating'].append(float(row[1]))
        data['Running'].append(int(row[2]))
        data['Year'].append(int(row[3]))
        data['Genre'].append(row[4])
        data['Votes'].append(int(row[5]))
        data['Director'].append(row[6])
        data['Writer'].append(row[7])
        data['Production Company'].append(row[8])
        actors = row[9].split(',')
        data['Actor1'].append(actors[0])
        data['Actor2'].append(actors[1])
        data['Actor3'].append(actors[2])
        data['Actor4'].append(actors[3])

        labels.append(int(row[10]))
        
except Exception as e:
    print(str(e))

labels_clean = []
for l in labels:
    if l >= 1 and l < 5:
        labels_clean.append(1)
    elif l >= 5 and l < 8:
        labels_clean.append(2)
    else:
        labels_clean.append(3)


df = pd.DataFrame(data, columns=['Rating', 'Running', 'Year', 'Genre', 'Director',     'Writer', 'Production Company', 'Actor1', 'Actor2', 'Actor3', 'Actor4'])

def Encoder(df):
    columnsToEncode = list(df.select_dtypes(include = ['category', 'object']))
    le = LabelEncoder()
    for feature in columnsToEncode:
        try:
            df[feature] = le.fit_transform(df[feature])
        except:
            print('Error encoding ' + feature)
    return df

df_processed = Encoder(df)
dataset = df_processed.values


labels = to_categorical(np.asarray(labels_clean)-1)


l = len(dataset)
x_train = dataset[:math.floor(l * 0.75)]
y_train = labels[:math.floor(l * 0.75)]


x_val = dataset[math.floor(l * 0.25):]
y_val = labels[math.floor(l * 0.25):]

model = models.Sequential()
model.add(layers.Dense(128, activation = 'relu', input_dim = 11))
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dense(32, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(3, activation = 'softmax'))



model.compile(optimizer = 'rmsprop',
          loss = 'categorical_crossentropy',
          metrics = ['accuracy'])

history = model.fit(x_train, y_train, epochs = 20,
                batch_size = 64,
                validation_data = (x_val, y_val))

这是我所说的精度图:

这是我的数据集一行的示例:

Misery,7.8,107,1990,"Drama, Thriller",194775,Rob Reiner,"Stephen King, William Goldman",Castle Rock Entertainment,"James Caan, Kathy Bates, Richard Farnsworth, Frances Sternhagen, Lauren Bacall, Graham Jarvis, Jerry Potter, Thomas Brunelle, June Christopher, Julie Payne, Archie Hahn, Gregory Snegoff, Wendy Bowers, Misery the Pig",8

这就是我的数据集的编码方式:

所以,正如我已经说过的,我的问题是为什么模型会这样,是否有好的行动方案可以提高模型的准确性。提前致谢。

尝试通过回调 ReduceLROnPlateau 使用可调学习率。文档是 here. 设置它来监控验证损失。建议代码如下

rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, 
                                             patience=2, verbose=1)

也使用回调 EarlyStopping。文档为 here. 推荐代码如下所示

es=tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=4, verbose=1,
                                  restore_best_weights=True)
callbacks=[rlronp, es]

在 model.fit 中添加代码 callbacks=callbacks。将 epochs 设置为一个较大的值,以便激活 earlystopping。使用 restore_best_weights=True 你的模型将 具有验证损失最低的时期的权重。我相信你的模型很容易过度拟合。所以我会添加至少一两个形式为

的 dropout 层
model.add(layers.Dropout(rate=.4, seed=123))

如果你仍然有过度拟合,你可能想在表格的密集层中添加调节器

Dense(256, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
                bias_regularizer=regularizers.l1(0.006) ,activation='relu')

我通过标准化数据解决了我的问题:

l = len(dataset)
x_train = dataset[:math.floor(l * 0.75)]
y_train = labels[:math.floor(l * 0.75)]


x_val = dataset[math.floor(l * 0.75):]
y_val = labels[math.floor(l * 0.75):]


mean = x_train.mean(axis = 0)
x_train -= mean
std = x_train.std(axis = 0)
x_train /= std

x_val -= mean
x_val /= std

这导致准确度提高到 ~80%