未分配时如何查找 "num_words" 或 Keras 分词器的词汇量？

Question

因此，如果我在初始化 Tokenizer() 时不传递 num_words 参数，我如何找到用于标记训练数据集的词汇表大小？

为什么这样，我不想限制分词器词汇量大小来了解我的 Keras 模型在没有它的情况下表现如何。但随后我需要将这个词汇量大小作为模型第一层定义中的参数传递。

Answer 1

所有单词及其索引都将存储在字典中，您可以使用 tokenizer.word_index 访问它。因此，根据本词典的元素个数，可以查出唯一词的个数：

num_words = len(tokenizer.word_index) + 1

+ 1 是因为保留了填充（即索引零）。

注意：当您没有设置num_words参数时（即您不知道或不想限制字数），此解决方案（显然）适用), 因为 word_index 包含 所有单词 （而且不仅是最常用的单词），无论您是否设置 num_words。

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?