为什么在训练 gensim doc2vec 时使用 TaggedBrownCorpus

Question

我目前正在使用带有标记文档的自定义语料库

class ClassifyCorpus(object):
    def __iter__(self):
        with open(train_data) as fp:
            for line in fp:
                splt = line.split(':')
                id = splt[0]
                text = splt[1].replace('\n', '')
                yield TaggedDocument(text.split(), [id])

查看 Brown Corpus 的源代码，发现它只是从目录中读取并为我处理文档的标记。

我对其进行了测试，但未发现训练速度有所提高。

Answer 1

你不应该使用 TaggedBrownCorpus。它只是一个演示 class，用于阅读特定的小型演示数据集，该数据集包含在 gensim 中，用于单元测试和介绍教程。

它以一种合理的方式处理磁盘上的数据格式，但任何其他将数据放入类似 TaggedDocument 的可重复迭代对象序列的有效方法都一样好.

因此，如果有帮助，请随意将其用作模型，但不要将其视为要求或 "best practice"。

为什么在训练 gensim doc2vec 时使用 TaggedBrownCorpus

Why use TaggedBrownCorpus when training gensim doc2vec

python

corpus

gensim

doc2vec