我们可以以分布式方式构建 word2vec 模型吗？

Can we build word2vec model in a distributed way?

目前我有1.2tb的文本数据来构建gensim的word2vec模型。几乎需要 15 到 20 天才能完成。

我想为 5tb 的文本数据构建模型，那么创建模型可能需要几个月的时间。我需要最小化这个执行时间。有什么办法可以使用多个大系统来创建模型？

请提出任何可以帮助我减少执行时间的方法。

仅供参考，我的所有数据都在 S3 中，我使用 smart_open 模块来传输数据。

训练一个庞大的语料库的模型肯定会花费很长时间，因为涉及的权重很大。假设您的词向量有 300 个成分，词汇量为 10,000。权重矩阵的大小为300*10000 = 300万！

要为庞大的数据集构建模型，我建议您首先对数据集进行预处理。可以应用以下预处理步骤：

正在删除停用词。
将单词对或短语视为单个单词，例如将纽约视为 new_york，等等
对频繁词进行子采样以减少训练示例的数量。
使用他们称为“负采样”的技术修改优化objective，这会导致每个训练样本仅更新模型权重的一小部分。

上述任务也在Google发布的官方word2vec实现中完成。 Gensim 提供了非常漂亮的高级 API 来执行上述大部分任务。另外，请查看此 blog 以进一步优化技术。

可以做的另一件事是使用 Google 发布的已经训练好的 word2vec model 来代替训练自己的模型，它有 1.5GB 的空间，包括 300 万个单词的词向量以及他们从 Google 新闻数据集中训练的大约 1000 亿个单词的短语。

您可以使用 Apache Spark。 https://javadoc.io/doc/org.apache.spark/spark-mllib_2.12/latest/org/apache/spark/mllib/feature/Word2Vec.html

Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

Apache/Spark at e053c55

我们可以以分布式方式构建 word2vec 模型吗？

Can we build word2vec model in a distributed way?

nlp

distributed-computing

gensim

word2vec

deep-learning