字符 LSTM 不断生成相同的字符序列

Question

我正在使用 keras 训练一个 2 层字符 LSTM，以生成类似于我正在训练的语料库的字符序列。然而，当我训练 LSTM 时，经过训练的 LSTM 生成的输出一遍又一遍地是相同的序列。

我看到了针对类似问题的建议，包括增加 LSTM 输入序列长度、增加批量大小、添加 dropout 层和增加 dropout 数量。我已经尝试了所有这些方法，其中 none 似乎已经解决了问题。取得一些成功的一件事是在生成过程中向 LSTM 输出的每个向量添加随机噪声向量。这是有道理的，因为 LSTM 使用上一步的输出来生成下一个输出。然而，一般来说，如果我添加足够多的噪声来打破 LSTM 的重复生成，输出的质量就会大大降低。

我的LSTM训练代码如下：

# [load data from file]
raw_text = collected_statements.lower()
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text + '\b')))
char_to_int = dict((c, i) for i, c in enumerate(chars)) 
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out]) 

# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), 
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, 
save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fix random seed for reproducibility
seed = 8
numpy.random.seed(seed)
# split into 80% for train and 20% for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
  random_state=seed)

# train the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=18, 
  batch_size=256, callbacks=callbacks_list)

我的生成代码如下：

filename = "weights-improvement-18-1.5283.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
int_to_char = dict((i, c) for i, c in enumerate(chars))
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = unpadded_patterns[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = (x / float(n_vocab)) + (numpy.random.rand(1, len(pattern), 1) * 0.01)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
#print(index)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
sys.stdout.write(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print("\nDone.")

当我运行生成代码时，我一遍又一遍地得到相同的序列：

we have the best economy in the history of our country." "we have the best 
economy in the history of our country." "we have the best economy in the 
history of our country." "we have the best economy in the history of our 
country." "we have the best economy in the history of our country." "we 
have the best economy in the history of our country." "we have the best 
economy in the history of our country." "we have the best economy in the 
history of our country." "we have the best economy in the history of our 
country."

除了一遍又一遍地生成相同序列之外，还有什么我可以尝试的，可以帮助生成一些东西吗？

Answer 1

在你的角色生成中，我建议从你的模型输出的概率中抽样，而不是直接采用 argmax。这就是 keras example char-rnn 为获得多样性所做的。

这是他们在示例中用于采样的代码：

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

在你的代码中你有 index = numpy.argmax(prediction)

我建议将其替换为 index = sample(prediction) 并试验您选择的温度。请记住，较高的温度会使您的输出更加随机，而较低的温度会使它的随机性降低。

Answer 2

模型生成的输出是给定前一个字符的下一个字符的概率。而在文本生成过程中，您只需使用 maximum 概率的字符。相反，通过根据模型生成的概率分布采样下一个字符，可能有助于在此过程中注入一些随机性（即随机性）。一种简单的方法是使用 np.random.choice 函数：

# get the probability distribution generated by the model
prediction = model.predict(x, verbose=0)

# sample the next character based on the predicted probabilites
idx = np.random.choice(y.shape[1], 1, p=prediction[0])[0]

# the rest is the same...

这样，下一个选定的字符并不总是最有可能 的字符。相反，所有字符 都有机会被选择 由您的模型生成的概率分布 指导。这种随机性不仅打破了重复循环，而且可能会产生一些有趣的生成文本。

此外，您可以通过引入 softmax temperature in the sampling process, which you can see in the @Primusa's answer which is based on the Keras char-rnn example 进一步注入随机性。基本上，它的想法是它会重新加权概率分布，以便您可以控制下一个选定字符的出人意料程度（即更高 temperature/entropy）或可预测程度（即更低 temperature/entropy）。

字符 LSTM 不断生成相同的字符序列

Character LSTM keeps generating same character sequence

python

lstm

keras

tensorflow

recurrent-neural-network