Keras Tokenizer 序列到文本更改词序

Keras Tokenizer sequence to text changes word order

我正在 DUC2004 和 Giga 词语料库上训练一个模型,为此我使用 keras 的 Tokenizer() 如下:

tokenizer = Tokenizer(num_of_words) 
tokenizer.fit_on_texts(list(x_train))

#convert text sequences into integer sequences
train_seq    =   tokenizer.texts_to_sequences(x_train) 
val_seq   =   tokenizer.texts_to_sequences(y_val) 

#padding zero upto maximum length
train_seq    =   pad_sequences(train_seq, maxlen=max_summary_len, padding='post')
val_seq   =   pad_sequences(val_seq, maxlen=max_summary_len, padding='post') 

当我尝试将序列改回文本时,词序发生变化并给出一些奇怪的输出。

例如:

实际句子:

chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them

序列到文本的转换:

police were wednesday for the bodies of four kidnapped foreigners who were during a to free them

我尝试使用 Tokenier() 的 sequence_to_text() 方法以及使用 word_index.

映射单词

我无法理解为什么会发生这种情况以及如何纠正它。

您的 X_train 应该是一个原始文本列表,其中该列表的每个元素对应一个文档(文本)。试试下面的代码:

x_train = ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
        'I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows']

tokenizer = Tokenizer(1000) 
tokenizer.fit_on_texts(x_train)

train_seq = tokenizer.texts_to_sequences(x_train)
train_seq = pad_sequences(train_seq, maxlen=100, padding='post')

tokenizer.sequences_to_texts(train_seq)

输出:

['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
 'i am training a model on duc2004 and giga word corpus for which i am using tokenizer from keras as follows']