Gensim Word2Vec 使用太多内存

Question

我想在大小为 400MB 的标记化文件上训练 word2vec 模型。我一直在尝试运行这个 python 代码：

import operator
import gensim, logging, os
from gensim.models import Word2Vec
from gensim.models import *

class Sentences(object):
    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        for line in open(self.filename):
            yield line.split()

def runTraining(input_file,output_file):
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences = Sentences(input_file)
    model = gensim.models.Word2Vec(sentences, size=200)
    model.save(output_file)

当我在我的文件上调用这个函数时，我得到了这个：

2017-10-23 17:57:00,211 : INFO : collecting all words and their counts
2017-10-23 17:57:04,071 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-10-23 17:57:16,116 : INFO : collected 4735816 word types from a corpus of 47054017 raw words and 1 sentences
2017-10-23 17:57:16,781 : INFO : Loading a fresh vocabulary
2017-10-23 17:57:18,873 : INFO : min_count=5 retains 290537 unique words (6% of original 4735816, drops 4445279)
2017-10-23 17:57:18,873 : INFO : min_count=5 leaves 42158450 word corpus (89% of original 47054017, drops 4895567)
2017-10-23 17:57:19,563 : INFO : deleting the raw counts dictionary of 4735816 items
2017-10-23 17:57:20,217 : INFO : sample=0.001 downsamples 34 most-common words
2017-10-23 17:57:20,217 : INFO : downsampling leaves estimated 35587188 word corpus (84.4% of prior 42158450)
2017-10-23 17:57:20,218 : INFO : estimated required memory for 290537 words and 200 dimensions: 610127700 bytes
2017-10-23 17:57:21,182 : INFO : resetting layer weights
2017-10-23 17:57:24,493 : INFO : training model with 3 workers on 290537 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-10-23 17:57:28,216 : INFO : PROGRESS: at 0.00% examples, 0 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:32,107 : INFO : PROGRESS: at 20.00% examples, 1314 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:36,071 : INFO : PROGRESS: at 40.00% examples, 1728 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:41,059 : INFO : PROGRESS: at 60.00% examples, 1811 words/s, in_qsize 0, out_qsize 0
Killed

我知道word2vec需要很多space，但我还是觉得这里有问题。如您所见，此型号的估计内存为 600MB，而我的电脑有 16GB 的内存。然而，当代码运行s 显示它占用了我所有的内存然后被杀死时监视进程。

正如其他帖子所建议的那样，我已尝试增加 min_count 并减小大小。但即使有荒谬的值（min_count=50，size=10），该过程也会在 60% 处停止。

我还尝试使 python 成为 OOM 的例外，这样进程就不会被终止。当我这样做时，我有一个 MemoryError 而不是 killing。

这是怎么回事？

（我最近使用的笔记本电脑配备 Ubuntu 17.04、16GB RAM 和 Nvidia GTX 960M。我使用来自 Anaconda 的运行 python 3.6 和 gensim 3.0，但它没有使用 gensim 2.3 做得更好)

Answer 1

您的文件是单行的，如日志输出所示：

2017-10-23 17:57:16,116 : INFO : collected 4735816 word types from a corpus of 47054017 raw words and 1 sentences

怀疑这是不是你想要的；特别是 gensim Word2Vec 中优化的 cython 代码只能处理 10,000 个单词的句子，然后截断它们（并丢弃其余的）。所以你的大部分数据在训练期间都没有被考虑（即使它要完成）。

但更大的问题是，单个 4700 万字的行将作为一个巨大的字符串进入内存，然后 split() 成为一个 4700 万条目的字符串列表。因此，您尝试使用内存高效的迭代器没有任何帮助——整个文件被放入内存，两次，一次 'iteration'。

我仍然看不到使用完整的 16GB RAM 的情况，但更正它可能会解决问题，或者使任何遗留问题更加明显。

如果您的标记化数据在 10,000 个标记的句子长度附近或以下没有自然换行符，您可以查看 gensim 中包含的示例语料库 class LineSentence 如何能够要处理（也缺少换行符）text8 或 text9 语料库，将每个生成的句子限制为 10,000 个标记：

https://github.com/RaRe-Technologies/gensim/blob/58b30d71358964f1fc887477c5dc1881b634094a/gensim/models/word2vec.py#L1620

（它可能不是影响因素，但您可能还想使用 with 上下文管理器来确保您的 open()ed 文件在迭代器耗尽后立即关闭。）

Gensim Word2Vec 使用太多内存

Gensim Word2Vec uses too much memory

memory

python-3.x

gensim

word2vec