如何从 word2vec 的 Google 预训练模型中提取词向量？

Question

文件GoogleNews-vectors-negative300.bin包含3亿个词向量。我认为（不确定）这个文件是在写下一行时加载的：

from gensim.models.keyedvectors import KeyedVectors

我想下载我在名为 words 的列表中从外部提供的单词的向量。这是我执行此操作的代码：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];

model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)

M = len(words)
count = 0
for i in range(1,M):
    wi = id2word[words[i]]
    if wi in word2vec.vocab:
        vector[:,count] = model[:,i]
        count = count+1

f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()

但是当我运行代码时，它只会冻结我的系统。是因为它在搜索 words 中的单词之前加载了整个二进制文件吗？如果是，我该如何解决这个问题？当我收到以下警告时，我想到了这一点，这就是为什么我使用 warning 包来抑制它：

c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

它给出的错误是：

Traceback (most recent call last):
  File "word2vec.py", line 18, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True) 
  File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
    raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]

我猜这意味着程序无法搜索二进制文件中的单词。那么，如何解决呢？

Answer 1

使用以下代码从 Google 训练好的 word2vec 模型中提取词向量：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# this line doesn't load the trained model 
from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport']

# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)

# to extract word vector
print(model[words[0]])  #access

结果向量：

[ -8.74023438e-02  -1.86523438e-01 .. ]

由于模型太大，您的系统正在冻结。尝试使用具有更多内存的系统，或者您可以限制正在加载的模型的大小。

加载时限制模型大小

model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)

如何从 word2vec 的 Google 预训练模型中提取词向量？

How to extract a word vector from the Google pre-trained model for word2vec?

python

file-handling

gensim

word2vec