Gensim 中的 Word2Vec 使用 model.most_similar

Question

我是 Gensim 'Word2Vec' 的新手。我想为文本构建一个 Word2Vec 模型（摘自维基百科：机器学习）并找到 与 'Machine Learning' 最相似的词。

我现在的代码如下

# import modules & set up logging
from gensim.models import Word2Vec

sentences = "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision."
# train word2vec on the sentences
model = Word2Vec(sentences, min_count=1)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])

但是，对于 vocab，我得到一个字符输出。

['M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'r']

请帮助我使用 model.most_similar

获取 most_similar_words

Answer 1

Word2Vec class 期望它的 sentences 语料库是单个项目的可迭代源，每个项目都是单词标记列表。

您提供的是单个字符串。如果它迭代它，它会得到单独的字符。如果它随后尝试将这些单个字符解释为标记列表，它仍然只会得到一个单个字符——因此它看到的唯一 'words' 是单个字符。

至少，您希望语料库的构建更像这样：

sentences = [
    "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision.".split(),
]

这仍然只是一个 'sentence'，但它将按空格拆分为单词标记。

另请注意，有用的 word2vec 结果需要大量不同的文本样本——玩具大小的示例通常不会显示 word2vec 以创建而闻名的单词相似性或单词相对排列的种类。

Gensim 中的 Word2Vec 使用 model.most_similar

Word2Vec in Gensim using model.most_similar

python

gensim

word2vec