来自 Gensim(word2vec 模型)的哪些经过训练的嵌入向量应该用于 Tensorflow?非规范化或规范化的?
Which trained embeddings vectors from Gensim (word2vec model) should be used for Tensorflow? Unnormalised or normalised ones?
我想在神经网络 (Tensorflow) 中使用 Gensim(word2vec 模型)训练的向量。为此,我可以使用两种重量。第一组是model.syn0
,第二组是model.vectors_norm
(调用后model.init_sims(replace=True)
)。第二个是我们用来计算相似度的一组向量。哪个具有正确的顺序(与 model.wv.index2word
和 model.wv.vocab[X].index
匹配)和神经网络嵌入层的权重?
如果您使用 Google 的 GoogleNews-vectors
作为预训练模型,您可以使用 model.syn0
。如果您使用 Facebook 的 fastText
词嵌入,您可以直接加载二进制文件。
下面是加载两个实例的示例。
加载Google新闻预训练嵌入:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True,limit=500000) # To load the model first time.
model.wv.save_word2vec_format(model_path) #You can save the loaded model to binary file to load the model faster
model = gensim.models.KeyedVectors.load(model_path,mmap='r')
model.syn0norm = model.syn0
index2word_set = set(model.index2word)
model[word] gives the vector representation of the word which can be used to find similarity.
加载 fastText 预训练嵌入:
import gensim
from gensim.models import FastText
model = FastText.load_fasttext_format('cc.en.300') # to load the model for first time.
model.save("fasttext_en_bin") # Save the model to binary file to load faster.
model = gensim.models.KeyedVectors.load("fasttext_en_bin",mmap="r")
index2word_set = set(model.index2word)
model[word] gives the vector representation of the word which can be used to find similarity.
一般示例:
if word in index2word:
feature_vec = model[word]
我想在神经网络 (Tensorflow) 中使用 Gensim(word2vec 模型)训练的向量。为此,我可以使用两种重量。第一组是model.syn0
,第二组是model.vectors_norm
(调用后model.init_sims(replace=True)
)。第二个是我们用来计算相似度的一组向量。哪个具有正确的顺序(与 model.wv.index2word
和 model.wv.vocab[X].index
匹配)和神经网络嵌入层的权重?
如果您使用 Google 的 GoogleNews-vectors
作为预训练模型,您可以使用 model.syn0
。如果您使用 Facebook 的 fastText
词嵌入,您可以直接加载二进制文件。
下面是加载两个实例的示例。
加载Google新闻预训练嵌入:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True,limit=500000) # To load the model first time.
model.wv.save_word2vec_format(model_path) #You can save the loaded model to binary file to load the model faster
model = gensim.models.KeyedVectors.load(model_path,mmap='r')
model.syn0norm = model.syn0
index2word_set = set(model.index2word)
model[word] gives the vector representation of the word which can be used to find similarity.
加载 fastText 预训练嵌入:
import gensim
from gensim.models import FastText
model = FastText.load_fasttext_format('cc.en.300') # to load the model for first time.
model.save("fasttext_en_bin") # Save the model to binary file to load faster.
model = gensim.models.KeyedVectors.load("fasttext_en_bin",mmap="r")
index2word_set = set(model.index2word)
model[word] gives the vector representation of the word which can be used to find similarity.
一般示例:
if word in index2word:
feature_vec = model[word]