如何通过sklearn计算两个字符串列表的余弦相似度?
How to calculate the cosine similarity of two string list by sklearn?
我有两个带有这样字符串的列表,
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
我想计算这两个列表的余弦相似度,我知道如何实现,
# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)
# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]
# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)
print(cosine)
但是,如果我想在sklearn中使用cosine_similarity
,它显示问题:could not convert string to float: 'a'
如何更正它?
from sklearn.metrics.pairwise import cosine_similarity
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
print(cosine_similarity(a_file, b_file))
好像需要
- word-vectors,
- 二维数据(列表有很多word-vectors)
print(cosine_similarity( [a_vect], [b_vect] ))
完整的工作代码:
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)
# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]
# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)
print(cosine)
print(cosine_similarity([a_vect], [b_vect]))
结果:
0.2886751345948129
[[0.28867513]]
编辑:
您也可以使用一个包含所有数据的列表(因此第二个参数将为 None
)
它将比较所有对 (a,a)
、(a,b)
、(b,a)
、(b,b)
.
print(cosine_similarity( [a_vect, b_vect] ))
结果:
[[1. 0.28867513]
[0.28867513 1. ]]
您可以使用更长的列表 [a,b,c, ...]
,它会检查所有可能的对。
我有两个带有这样字符串的列表,
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
我想计算这两个列表的余弦相似度,我知道如何实现,
# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)
# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]
# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)
print(cosine)
但是,如果我想在sklearn中使用cosine_similarity
,它显示问题:could not convert string to float: 'a'
如何更正它?
from sklearn.metrics.pairwise import cosine_similarity
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
print(cosine_similarity(a_file, b_file))
好像需要
- word-vectors,
- 二维数据(列表有很多word-vectors)
print(cosine_similarity( [a_vect], [b_vect] ))
完整的工作代码:
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)
# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]
# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)
print(cosine)
print(cosine_similarity([a_vect], [b_vect]))
结果:
0.2886751345948129
[[0.28867513]]
编辑:
您也可以使用一个包含所有数据的列表(因此第二个参数将为 None
)
它将比较所有对 (a,a)
、(a,b)
、(b,a)
、(b,b)
.
print(cosine_similarity( [a_vect, b_vect] ))
结果:
[[1. 0.28867513]
[0.28867513 1. ]]
您可以使用更长的列表 [a,b,c, ...]
,它会检查所有可能的对。