Keras Tokenizer 序列到文本更改词序
Keras Tokenizer sequence to text changes word order
我正在 DUC2004 和 Giga 词语料库上训练一个模型,为此我使用 keras 的 Tokenizer() 如下:
tokenizer = Tokenizer(num_of_words)
tokenizer.fit_on_texts(list(x_train))
#convert text sequences into integer sequences
train_seq = tokenizer.texts_to_sequences(x_train)
val_seq = tokenizer.texts_to_sequences(y_val)
#padding zero upto maximum length
train_seq = pad_sequences(train_seq, maxlen=max_summary_len, padding='post')
val_seq = pad_sequences(val_seq, maxlen=max_summary_len, padding='post')
当我尝试将序列改回文本时,词序发生变化并给出一些奇怪的输出。
例如:
实际句子:
chechen police were searching wednesday for the bodies of four
kidnapped foreigners who were beheaded during a botched attempt to
free them
序列到文本的转换:
police were wednesday for the bodies of four kidnapped foreigners who
were during a to free them
我尝试使用 Tokenier() 的 sequence_to_text() 方法以及使用 word_index.
映射单词
我无法理解为什么会发生这种情况以及如何纠正它。
您的 X_train
应该是一个原始文本列表,其中该列表的每个元素对应一个文档(文本)。试试下面的代码:
x_train = ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
'I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows']
tokenizer = Tokenizer(1000)
tokenizer.fit_on_texts(x_train)
train_seq = tokenizer.texts_to_sequences(x_train)
train_seq = pad_sequences(train_seq, maxlen=100, padding='post')
tokenizer.sequences_to_texts(train_seq)
输出:
['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
'i am training a model on duc2004 and giga word corpus for which i am using tokenizer from keras as follows']
我正在 DUC2004 和 Giga 词语料库上训练一个模型,为此我使用 keras 的 Tokenizer() 如下:
tokenizer = Tokenizer(num_of_words)
tokenizer.fit_on_texts(list(x_train))
#convert text sequences into integer sequences
train_seq = tokenizer.texts_to_sequences(x_train)
val_seq = tokenizer.texts_to_sequences(y_val)
#padding zero upto maximum length
train_seq = pad_sequences(train_seq, maxlen=max_summary_len, padding='post')
val_seq = pad_sequences(val_seq, maxlen=max_summary_len, padding='post')
当我尝试将序列改回文本时,词序发生变化并给出一些奇怪的输出。
例如:
实际句子:
chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them
序列到文本的转换:
police were wednesday for the bodies of four kidnapped foreigners who were during a to free them
我尝试使用 Tokenier() 的 sequence_to_text() 方法以及使用 word_index.
映射单词我无法理解为什么会发生这种情况以及如何纠正它。
您的 X_train
应该是一个原始文本列表,其中该列表的每个元素对应一个文档(文本)。试试下面的代码:
x_train = ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
'I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows']
tokenizer = Tokenizer(1000)
tokenizer.fit_on_texts(x_train)
train_seq = tokenizer.texts_to_sequences(x_train)
train_seq = pad_sequences(train_seq, maxlen=100, padding='post')
tokenizer.sequences_to_texts(train_seq)
输出:
['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
'i am training a model on duc2004 and giga word corpus for which i am using tokenizer from keras as follows']