Keras Tokenizer 序列到文本更改词序

Question

我正在 DUC2004 和 Giga 词语料库上训练一个模型，为此我使用 keras 的 Tokenizer() 如下：

tokenizer = Tokenizer(num_of_words) 
tokenizer.fit_on_texts(list(x_train))

#convert text sequences into integer sequences
train_seq    =   tokenizer.texts_to_sequences(x_train) 
val_seq   =   tokenizer.texts_to_sequences(y_val) 

#padding zero upto maximum length
train_seq    =   pad_sequences(train_seq, maxlen=max_summary_len, padding='post')
val_seq   =   pad_sequences(val_seq, maxlen=max_summary_len, padding='post')

当我尝试将序列改回文本时，词序发生变化并给出一些奇怪的输出。

例如：

实际句子：

chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them

序列到文本的转换：

police were wednesday for the bodies of four kidnapped foreigners who were during a to free them

我尝试使用 Tokenier() 的 sequence_to_text() 方法以及使用 word_index.

映射单词

我无法理解为什么会发生这种情况以及如何纠正它。

Answer 1

您的 X_train 应该是一个原始文本列表，其中该列表的每个元素对应一个文档（文本）。试试下面的代码：

x_train = ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
        'I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows']

tokenizer = Tokenizer(1000) 
tokenizer.fit_on_texts(x_train)

train_seq = tokenizer.texts_to_sequences(x_train)
train_seq = pad_sequences(train_seq, maxlen=100, padding='post')

tokenizer.sequences_to_texts(train_seq)

输出：

['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
 'i am training a model on duc2004 and giga word corpus for which i am using tokenizer from keras as follows']

Keras Tokenizer 序列到文本更改词序

Keras Tokenizer sequence to text changes word order

python

nlp

tokenize

keras

tensorflow