使用余弦相似度将列表与 pandas 中的行进行比较并获得排名
Compare a list with the rows in pandas using Cosine similarity and get the rank
我有一个 Pandas 数据框和一个用户输入,我需要将用户输入与数据框中的每一行进行比较,并根据余弦相似度获得数据框中行的排名列表。
Department Country Age Grade Score
Math India Young A 97
Math India Young B 86
Math India Young D 68
Science India Young A 92
Science India Young B 81
Science India Young C 76
Social India Young B 88
Social India Young D 62
Social India Young C 72
用户输入:
Country Age Grade Score
India Young B 84
India Young D 65
India Young A 98
我更愿意将数据框的所有行视为列表,
并将用户输入视为列表。
说 User_list1 = ['India','Young','B','84']
并使用余弦相似度将其与数据帧的每一行进行比较(将它们视为列表)并获得 Department
.
的排名输出
在我的例子中,输出将是 Department :
Out = ['Math','Science','Social']
的排名列表:这应该基于余弦相似度结果。
考虑到上述两个数据帧,
df
Department Country Age Grade Score
0 Math India Young A 97
1 Math India Young B 86
2 Math India Young D 68
3 Science India Young A 92
4 Science India Young B 81
5 Science India Young C 76
6 Social India Young B 88
7 Social India Young D 62
8 Social India Young C 72
input
Country Age Grade Score
0 India Young B 84
1 India Young D 65
2 India Young A 98
一种可能的解决方案是,
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys
使用 scikit-learn
包将分类特征转换为数字特征,
df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df
输出:
Department Country Age Grade Score
0 Math 0 0 0 97
1 Math 0 0 1 86
2 Math 0 0 3 68
3 Science 0 0 0 92
4 Science 0 0 1 81
5 Science 0 0 2 76
6 Social 0 0 1 88
7 Social 0 0 3 62
8 Social 0 0 2 72
input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input
输出:
Country Age Grade Score
0 0 0 1 84
1 0 0 2 65
2 0 0 0 98
定义一个cosine-similarity
函数,
def cosine_similarity(a, b):
nom = np.sum(np.multiply(a, b))
denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
sim = nom / denom
return sim
dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
similarity = []
for j in range(len(df)):
a = input.iloc[i]
b = df.iloc[j, 1:]
c_sim = cosine_similarity(a, b)
similarity.append(c_sim)
max_similarity = []
for k in range(0, len(df), 3):
max_3 = max(similarity[k:k+3])
max_similarity.append(max_3)
max_idx = max_similarity.index(max(max_similarity))
results.append(dept[max_idx])
results
输出:
['Math', 'Social', 'Math']
我有一个 Pandas 数据框和一个用户输入,我需要将用户输入与数据框中的每一行进行比较,并根据余弦相似度获得数据框中行的排名列表。
Department Country Age Grade Score
Math India Young A 97
Math India Young B 86
Math India Young D 68
Science India Young A 92
Science India Young B 81
Science India Young C 76
Social India Young B 88
Social India Young D 62
Social India Young C 72
用户输入:
Country Age Grade Score
India Young B 84
India Young D 65
India Young A 98
我更愿意将数据框的所有行视为列表,
并将用户输入视为列表。
说 User_list1 = ['India','Young','B','84']
并使用余弦相似度将其与数据帧的每一行进行比较(将它们视为列表)并获得 Department
.
在我的例子中,输出将是 Department :
Out = ['Math','Science','Social']
的排名列表:这应该基于余弦相似度结果。
考虑到上述两个数据帧,
df
Department Country Age Grade Score
0 Math India Young A 97
1 Math India Young B 86
2 Math India Young D 68
3 Science India Young A 92
4 Science India Young B 81
5 Science India Young C 76
6 Social India Young B 88
7 Social India Young D 62
8 Social India Young C 72
input
Country Age Grade Score
0 India Young B 84
1 India Young D 65
2 India Young A 98
一种可能的解决方案是,
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys
使用 scikit-learn
包将分类特征转换为数字特征,
df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df
输出:
Department Country Age Grade Score
0 Math 0 0 0 97
1 Math 0 0 1 86
2 Math 0 0 3 68
3 Science 0 0 0 92
4 Science 0 0 1 81
5 Science 0 0 2 76
6 Social 0 0 1 88
7 Social 0 0 3 62
8 Social 0 0 2 72
input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input
输出:
Country Age Grade Score
0 0 0 1 84
1 0 0 2 65
2 0 0 0 98
定义一个cosine-similarity
函数,
def cosine_similarity(a, b):
nom = np.sum(np.multiply(a, b))
denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
sim = nom / denom
return sim
dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
similarity = []
for j in range(len(df)):
a = input.iloc[i]
b = df.iloc[j, 1:]
c_sim = cosine_similarity(a, b)
similarity.append(c_sim)
max_similarity = []
for k in range(0, len(df), 3):
max_3 = max(similarity[k:k+3])
max_similarity.append(max_3)
max_idx = max_similarity.index(max(max_similarity))
results.append(dept[max_idx])
results
输出:
['Math', 'Social', 'Math']