使用 scikit-learn 高效计算余弦相似度

Efficiently calculate cosine similarity using scikit-learn

在预处理和转换(BOW、TF-IDF)数据之后,我需要计算它与数据集中其他元素的余弦相似度。目前,我这样做:

cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]

在这个例子中,每个输入变量,例如tr_title,是一个SciPy稀疏矩阵。但是,此代码运行 非常慢 。我可以做些什么来优化代码以使其运行得更快?

考虑到两个向量的余弦相似度的两个特征,您可以将每个计算的工作量减少一半以上:

  1. 向量与自身的余弦相似度为1。
  2. 向量x与向量y的余弦相似度与向量y[的余弦相似度相同=29=] 与向量 x.

因此,计算对角线上方或下方的元素。

编辑:这是您可以计算的方法。请特别注意 cs 只是一个虚拟函数,用于代替相似系数的实际计算。

title1 = 'A four word title'
title2 = 'A five word title'
title3 = 'A six word title'
title4 = 'A seven word title'

titles = [title1, title2, title3, title4]
N = len(titles)

import numpy as np

similarity_matrix = np.array(N**2*[0],np.float).reshape(N,N)

cs = lambda a,b: 10*a+b  # just a 'pretend' calculation of the coefficient

for m in range(N):
    similarity_matrix [m,m] = 1
    for n in range(m+1,N):
        similarity_matrix [m,n] = cs(m,n)
        similarity_matrix [n,m] = similarity_matrix [m,n]

print (similarity_matrix )

这是结果。

[[  1.   1.   2.   3.]
 [  1.   1.  12.  13.]
 [  2.  12.   1.  23.]
 [  3.  13.  23.   1.]]

要提高性能,您应该用矢量化代码替换列表理解。这可以通过 Numpy 的 pdist and squareform 轻松实现,如下面的代码片段所示:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

titles = [
    'A New Hope',
    'The Empire Strikes Back',
    'Return of the Jedi',
    'The Phantom Menace',
    'Attack of the Clones',
    'Revenge of the Sith',
    'The Force Awakens',
    'A Star Wars Story',
    'The Last Jedi',
    ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

演示:

In [87]: X
Out[87]: 
<9x21 sparse matrix of type '<type 'numpy.int64'>'
    with 30 stored elements in Compressed Sparse Row format>

In [88]: X.toarray()          
Out[88]: 
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [89]: vectorizer.get_feature_names()
Out[89]: 
[u'attack',
 u'awakens',
 u'back',
 u'clones',
 u'empire',
 u'force',
 u'hope',
 u'jedi',
 u'last',
 u'menace',
 u'new',
 u'of',
 u'phantom',
 u'return',
 u'revenge',
 u'sith',
 u'star',
 u'story',
 u'strikes',
 u'the',
 u'wars']

In [90]: np.set_printoptions(precision=2)

In [91]: print(cs_title)
[[ 0.    1.    1.    1.    1.    1.    1.    1.    1.  ]
 [ 1.    0.    0.75  0.71  0.75  0.75  0.71  1.    0.71]
 [ 1.    0.75  0.    0.71  0.5   0.5   0.71  1.    0.42]
 [ 1.    0.71  0.71  0.    0.71  0.71  0.67  1.    0.67]
 [ 1.    0.75  0.5   0.71  0.    0.5   0.71  1.    0.71]
 [ 1.    0.75  0.5   0.71  0.5   0.    0.71  1.    0.71]
 [ 1.    0.71  0.71  0.67  0.71  0.71  0.    1.    0.67]
 [ 1.    1.    1.    1.    1.    1.    1.    0.    1.  ]
 [ 1.    0.71  0.42  0.67  0.71  0.71  0.67  1.    0.  ]]

注意 X.toarray().shape 产生 (9L, 21L) 因为在上面的玩具示例中有 9 个标题和 21 个不同的词,而 cs_title9 x 9数组.