为什么 word_index 的长度大于 num_words？

Question

我有一个代码，关于深度学习的文本预处理：

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(X)
tokenizer.word_index

但是当我检查 tokenizer.word_index 的长度时，安全地得到 10000，我得到 13233.The X 的长度等于 11541（包含 11541 的数据帧列，如果重要的话知道，但是）。所以我的问题出现了：词汇量是多少？ num_words 还是 word_index 的长度？看来我糊涂了！任何帮助表示赞赏。

Answer 1

根据official docs，参数num_words是，

the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

word_index 将包含 texts 中出现的所有单词。但是当您使用 Tokenizer.texts_to_sequences 时会观察到差异。例如，让我们考虑一些句子，

texts = [
    'hello world' , 
    'hello python' , 
    'python' , 
    'hello java' ,
    'hello java' , 
    'hello python'
]
# Frequency of words, hello -> 5, python -> 3 , java -> 2 , world -> 1
tokenizer = tf.keras.preprocessing.text.Tokenizer( num_words=3 )
tokenizer.fit_on_texts( texts )
print( tokenizer.word_index )

以上片段的输出将是，

{'hello': 1, 'python': 2, 'java': 3, 'world': 4}

根据文档，前 num_words-1 个词（基于它们的频率）只能在将词转换为索引时使用。在我们的例子中 num_words=3 因此我们期望分词器只使用 2 词来进行转换。 texts 中最常用的两个词是 hello 和 python。考虑这个例子来检查 texts_to_sequences

的输出

input_seq = [
    'hello' , 
    'hello java' , 
    'hello python' , 
    'hello python java'
]
print( tokenizer.texts_to_sequences( input_seq ) )

输出，

[[1], [1], [1, 2], [1, 2]]

请注意，在第一句中，hello 的编码符合预期。在第二个句子中，单词 java 没有被编码，因为它没有包含在词汇表中。在第三句中，包含了 hello 和 python 这两个词，这是我们假设的预期行为。在第四句中，单词 java 未在输出中编码。

So my question arises: which is vocabulary size? num_words or the length of word_index?

正如您可能已经理解的那样，num_words 是词汇量，因为在输出中只有这些单词被编码。其余的词，在我们的例子中 java 和 world 被简单地从转换中省略。

为什么 word_index 的长度大于 num_words？

Why is the length of the word_index greater than num_words?

nlp

python-3.x

deep-learning

keras

tensorflow