total_words 必须与 corpus_file 参数一起提供

Question

我正在用语料库文件训练doc2vec，这个文件非常大。

model = Doc2Vec(dm=1, vector_size=200, workers=cores, comment='d2v_model_unigram_dbow_200_v1.0')
model.build_vocab(corpus_file=path)
model.train(corpus_file=path, total_examples=model.corpus_count, epochs=model.iter)

我想知道如何获取 total_words 的值。

编辑：

total_words=model.corpus_total_words

这样对吗？

Answer 1

根据当前（gensim 3.8.1，2019 年 10 月）Doc2Vec.train() documentation，您不需要同时提供 total_examples 和 total_words，只需提供其中之一：

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of documents) or total_words (count of raw words in documents) MUST be provided. If documents is the same corpus that was provided to build_vocab() earlier, you can simply use total_examples=self.corpus_count.

但是，事实证明新的 corpus_file 选项确实需要两者，并且文档注释是错误的。我已提交 a bug 以修复此文档疏忽。

是的，该模型将最近 build_vocab() 期间观察到的单词数缓存在 model.corpus_total_words 中，因此 total_words=model.corpus_total_words 应该为您做正确的事情。

当使用 corpus_file space 分隔文本输入选项时，corpus_count 和 corpus_total_words 给出的数字应与您的行数和字数相匹配还会在命令行中通过运行 wc your_file_path 查看。

（如果您使用的是经典的、普通的 Python 可迭代语料库选项（不能有效地使用线程），那么同时提供 total_examples 和 total_words 到 train() – 它只会使用其中之一来估计进度。）

total_words 必须与 corpus_file 参数一起提供

total_words must be provided alongside corpus_file argument

gensim