将最相似的余弦排序文档映射回原始列表中的每个相应文档

Question

我不知道如何将列表中排名靠前 (#1) 最相似的文档映射回原始列表中的每个文档项。

我进行了一些预处理、ngram、词形还原和 TF IDF。然后我使用 Scikit 的线性内核。我尝试使用提取功能，但不确定如何在 csr 矩阵中使用它...

尝试了各种方法 (Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix)

import string, nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer 
from sklearn.metrics.pairwise import cosine_similarity
import sparse_dot_topn.sparse_dot_topn as ct
import re

documents = 'the cat in the hat','the catty ate the hat','the cat wants the cats hat'

def ngrams(string, n=2):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, analyzer=ngrams, stop_words='english')
tfidf_matrix = TfidfVec.fit_transform(documents)

from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()

related_docs_indices = cosine_similarities.argsort()[:-5:-1]

cosine_similarities

我当前的示例仅让我获得针对所有文档的第一行。如何将看起来像这样的输出放入数据框中（注意原始文档来自数据框）。

original df col             most similar doc       similarity%
'the cat in the hat'        'the catty ate the hat'   80%
'the catty ate the hat'     'the cat in the hat'      80%
'the cat wants the cats hat' 'the catty ate the hat'  20%

Answer 1

import pandas as pd

df = pd.DataFrame(columns=["original df col", "most similar doc", "similarity%"])
for i in range(len(documents)):
    cosine_similarities = linear_kernel(tfidf_matrix[i:i+1], tfidf_matrix).flatten()
    # make pairs of (index, similarity)
    cosine_similarities = list(enumerate(cosine_similarities))
    # delete the cosine similarity with itself
    cosine_similarities.pop(i)
    # get the tuple with max similarity
    most_similar, similarity = max(cosine_similarities, key=lambda t:t[1])
    df.loc[len(df)] = [documents[i], documents[most_similar], similarity]

结果：

              original df col       most similar doc  similarity%
0          the cat in the hat  the catty ate the hat     0.664119
1       the catty ate the hat     the cat in the hat     0.664119
2  the cat wants the cats hat     the cat in the hat     0.577967

将最相似的余弦排序文档映射回原始列表中的每个相应文档

Map the most similar cosine ranking document back to each respective document in my original list

python

nlp

pandas

cosine-similarity

scikit-learn