Calculating cosine similarity: ValueError: Input must be 1- or 2-d

Question

希望大家都好。我正在尝试使用以下方法有效地计算由 HashingVectorizing (Sklearn) 我的数据集创建的 (29805, 40) 稀疏矩阵的余弦相似度。下面的方法来自@Waylon Flinn 对 this 问题的回答。

def cosine_sim(A):

    similarity = np.dot(A, A.T)

    # squared magnitude of preference vectors (number of occurrences)
    square_mag = np.diag(similarity)

    # inverse squared magnitude
    inv_square_mag = 1 / square_mag

    # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
    inv_square_mag[np.isinf(inv_square_mag)] = 0

    # inverse of the magnitude
    inv_mag = np.sqrt(inv_square_mag)

    # cosine similarity (elementwise multiply by inverse magnitudes)
    cosine = similarity * inv_mag
    return cosine.T * inv_mag

当我尝试使用虚拟矩阵时，一切正常。

A = np.random.randint(0, 2, (10000, 100)).astype(float)
cos_sim = cosine_sim(A)

但是当我尝试使用自己的矩阵时..

cos_sim = cosine_sim(sparse_matrix)

我明白了

ValueError: Input must be 1- or 2-d.

现在，在我的矩阵上调用 .shape returns (29805, 40)。那怎么不是二维的？有人可以告诉我我在这里做错了什么吗？错误发生在这里（来自jupyter notebook traceback）：

----> 6     square_mag = np.diag(similarity)

感谢阅读！对于上下文，调用 sparse_matrix returns this

<29805x40 sparse matrix of type '<class 'numpy.float64'>'
with 1091384 stored elements in Compressed Sparse Row format>

Answer 1

好的，在输入问题时，我尝试转换为 ndarray object，结果成功了。仍然发布问题和答案，它可能会对其他人有所帮助。干杯!

解决方案：

cos_sim = cosine_sim(sparse_matrix.A)

Answer 2

np.diag 以

开头

 v = asanyarray(v)

similarity = np.dot(A, A.T) 与 A 稀疏一起使用，因为它将操作委托给稀疏矩阵乘法。结果将是一个 sparse 矩阵 - 您可以自己检查一下。

然后尝试将其传递给 np.asanyarray。

Calculating cosine similarity: ValueError: Input must be 1- or 2-d

Calculating cosine similarity: ValueError: Input must be 1- or 2-d

python

nlp

numpy

linear-algebra

cosine-similarity