验证损失高的原因?
Reason for Validation loss is going high?
我对深度学习模型很陌生,正在尝试使用 LSTM 训练多标签分类文本模型。我有大约 2600 条记录,其中 4 categories.Using 80% 用于训练,其余用于验证.
代码中没有任何复杂的东西,即读取 csv、标记数据并提供给模型。
但是在 3-4 个时期之后,验证损失变得大于 1,而 train_loss 倾向于 zero.As 就我搜索而言,这是过度拟合的情况。为了克服这个问题,我尝试了不同的层,改变 units.But 仍然存在问题。
如果我停在 1-2 个时期,那么预测就会出错。
下面是我的模型创建代码:-
ACCURACY_THRESHOLD = 0.75
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
print(logs.get('val_accuracy'))
fname='Arabic_Model_'+str(logs.get('val_accuracy'))+'.h5'
if(logs.get('val_accuracy') > ACCURACY_THRESHOLD):
#print("\nWe have reached %2.2f%% accuracy, so we will stopping training." %(acc_thresh*100))
#self.model.stop_training = True
self.model.save(fname)
#from google.colab import files
#files.download(fname)
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = vocab_len
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 50
# This is fixed.
EMBEDDING_DIM = 100
callbacks = myCallback()
def create_model(vocabulary_size, seq_len):
model = models.Sequential()
model.add(Embedding(input_dim=MAX_NB_WORDS+1, output_dim=EMBEDDING_DIM,
input_length=seq_len,mask_zero=True))
model.add(GRU(units=64, return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(units=50))
#model.add(LSTM(100))
#model.add(Dropout(0.4))
#Bidirectional(tf.keras.layers.LSTM(embedding_dim))
#model.add(Bidirectional(LSTM(128)))
model.add(Dense(50, activation='relu'))
#model.add(Dense(200, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
return model
model=create_model(MAX_NB_WORDS, MAX_SEQUENCE_LENGTH)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_4 (Embedding) (None, 50, 100) 2018600
_________________________________________________________________
gru_2 (GRU) (None, 50, 64) 31680
_________________________________________________________________
dropout_10 (Dropout) (None, 50, 64) 0
_________________________________________________________________
lstm_6 (LSTM) (None, 14) 4424
_________________________________________________________________
dense_7 (Dense) (None, 50) 750
_________________________________________________________________
dropout_11 (Dropout) (None, 50) 0
_________________________________________________________________
dense_8 (Dense) (None, 4) 204
=================================================================
Total params: 2,055,658
Trainable params: 2,055,658
Non-trainable params: 0
_________________________________________________________________
model.fit(sequences, y_train, validation_data=(sequences_test, y_test),
epochs=25, batch_size=5, verbose=1,
callbacks=[callbacks]
)
如果我能得到一个确定的机会来克服这将是非常有帮助的 overfitting.You 可以参考下面的合作以查看完整代码:-
https://colab.research.google.com/drive/13N94kBKkHIX2TR5B_lETyuH1QTC5VuRf?usp=sharing
编辑:---
我现在正在使用我用 gensim 创建的预训练嵌入层,但现在准确度达到了 decreased.Also,我的记录大小是 4643。
附上下面的代码:- 在这个 'English_dict.p' 中是我使用 gensim 创建的字典。
embeddings_index=load(open('English_dict.p', 'rb'))
vocab_size=len(embeddings_index)+1
embedding_model = zeros((vocab_size, 100))
for word, i in embedding_matrix.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_model[i] = embedding_vector
model.add(Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM,
weights=[embedding_model],trainable=False,
input_length=seq_len,mask_zero=True))
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 50, 100) 2746300
_________________________________________________________________
gru_2 (GRU) (None, 50, 64) 31680
_________________________________________________________________
dropout_2 (Dropout) (None, 50, 64) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 128) 98816
_________________________________________________________________
dense_3 (Dense) (None, 50) 6450
_________________________________________________________________
dense_4 (Dense) (None, 4) 204
=================================================================
Total params: 2,883,450
Trainable params: 137,150
Non-trainable params: 2,746,300
_________________________________________________________________
如果我做错了什么,请告诉我。大家可以参考上面的联名作参考
没错,就是经典的过拟合。为什么会这样 - 神经网络有超过 200 万个 可训练参数 (2 055 658) 而你只有 2600 条记录(你使用 80% 进行训练)。 NN 太大,而不是泛化,记忆。
如何解决:
- 开始使用 pre-trained word embeddings in a Keras model;
- 使用 90% 的数据进行训练;
- 理论上,训练数据量应该比可训练参数;
的数量至少少2-3
我对深度学习模型很陌生,正在尝试使用 LSTM 训练多标签分类文本模型。我有大约 2600 条记录,其中 4 categories.Using 80% 用于训练,其余用于验证.
代码中没有任何复杂的东西,即读取 csv、标记数据并提供给模型。 但是在 3-4 个时期之后,验证损失变得大于 1,而 train_loss 倾向于 zero.As 就我搜索而言,这是过度拟合的情况。为了克服这个问题,我尝试了不同的层,改变 units.But 仍然存在问题。 如果我停在 1-2 个时期,那么预测就会出错。
下面是我的模型创建代码:-
ACCURACY_THRESHOLD = 0.75
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
print(logs.get('val_accuracy'))
fname='Arabic_Model_'+str(logs.get('val_accuracy'))+'.h5'
if(logs.get('val_accuracy') > ACCURACY_THRESHOLD):
#print("\nWe have reached %2.2f%% accuracy, so we will stopping training." %(acc_thresh*100))
#self.model.stop_training = True
self.model.save(fname)
#from google.colab import files
#files.download(fname)
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = vocab_len
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 50
# This is fixed.
EMBEDDING_DIM = 100
callbacks = myCallback()
def create_model(vocabulary_size, seq_len):
model = models.Sequential()
model.add(Embedding(input_dim=MAX_NB_WORDS+1, output_dim=EMBEDDING_DIM,
input_length=seq_len,mask_zero=True))
model.add(GRU(units=64, return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(units=50))
#model.add(LSTM(100))
#model.add(Dropout(0.4))
#Bidirectional(tf.keras.layers.LSTM(embedding_dim))
#model.add(Bidirectional(LSTM(128)))
model.add(Dense(50, activation='relu'))
#model.add(Dense(200, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
return model
model=create_model(MAX_NB_WORDS, MAX_SEQUENCE_LENGTH)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_4 (Embedding) (None, 50, 100) 2018600
_________________________________________________________________
gru_2 (GRU) (None, 50, 64) 31680
_________________________________________________________________
dropout_10 (Dropout) (None, 50, 64) 0
_________________________________________________________________
lstm_6 (LSTM) (None, 14) 4424
_________________________________________________________________
dense_7 (Dense) (None, 50) 750
_________________________________________________________________
dropout_11 (Dropout) (None, 50) 0
_________________________________________________________________
dense_8 (Dense) (None, 4) 204
=================================================================
Total params: 2,055,658
Trainable params: 2,055,658
Non-trainable params: 0
_________________________________________________________________
model.fit(sequences, y_train, validation_data=(sequences_test, y_test),
epochs=25, batch_size=5, verbose=1,
callbacks=[callbacks]
)
如果我能得到一个确定的机会来克服这将是非常有帮助的 overfitting.You 可以参考下面的合作以查看完整代码:-
https://colab.research.google.com/drive/13N94kBKkHIX2TR5B_lETyuH1QTC5VuRf?usp=sharing
编辑:--- 我现在正在使用我用 gensim 创建的预训练嵌入层,但现在准确度达到了 decreased.Also,我的记录大小是 4643。
附上下面的代码:- 在这个 'English_dict.p' 中是我使用 gensim 创建的字典。
embeddings_index=load(open('English_dict.p', 'rb'))
vocab_size=len(embeddings_index)+1
embedding_model = zeros((vocab_size, 100))
for word, i in embedding_matrix.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_model[i] = embedding_vector
model.add(Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM,
weights=[embedding_model],trainable=False,
input_length=seq_len,mask_zero=True))
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 50, 100) 2746300
_________________________________________________________________
gru_2 (GRU) (None, 50, 64) 31680
_________________________________________________________________
dropout_2 (Dropout) (None, 50, 64) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 128) 98816
_________________________________________________________________
dense_3 (Dense) (None, 50) 6450
_________________________________________________________________
dense_4 (Dense) (None, 4) 204
=================================================================
Total params: 2,883,450
Trainable params: 137,150
Non-trainable params: 2,746,300
_________________________________________________________________
如果我做错了什么,请告诉我。大家可以参考上面的联名作参考
没错,就是经典的过拟合。为什么会这样 - 神经网络有超过 200 万个 可训练参数 (2 055 658) 而你只有 2600 条记录(你使用 80% 进行训练)。 NN 太大,而不是泛化,记忆。
如何解决:
- 开始使用 pre-trained word embeddings in a Keras model;
- 使用 90% 的数据进行训练;
- 理论上,训练数据量应该比可训练参数; 的数量至少少2-3