打印余弦相似度得分小于 0.90 的文本

Question

我想在我的数据库上创建重复数据删除进程。我想用 Pythons Sklearn lib 测量余弦相似度分数。在新文本和数据库中已有的文本之间。

我只想添加余弦相似度得分小于 0.90 的文档。这是我的代码：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


list_of_texts_in_database = ["More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.",
                             "UK ministers say sanctions could target companies and individuals linked to the Russian government.",
                             "Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
                             "He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.",
                             "Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.",
                             "The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control",
                             "A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control",
                             "That could lead to a major escalation in conflict."]


list_of_new_texts = ["This is a totaly new document that needs to be added into the database one way or another.",
                     "Boris Johnson also says the UK could limit Russian firm ability to raise capital on London's market.",
                     "Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.",
                     "UK minister say sanctions could target companies and individuals linked to the Russian government.",
                     "That could lead to a major escalation in conflict."]


vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words = None, ngram_range=(1, 1))


list_of_texts_in_database_tfidf = vectorizer.fit_transform(list_of_texts_in_database)
list_of_new_texts_tfidf = vectorizer.transform(list_of_new_texts)

cosineSimilarities = cosine_similarity(list_of_new_texts_tfidf, list_of_texts_in_database_tfidf)
print(cosineSimilarities)

这段代码效果很好，但我不知道如何映射结果（如何获得相似度分数小于 0.90 的文本）

Answer 1

我的建议如下。您只添加那些分数小于（或等于）0.9 的文本。

import numpy as np

idx = np.where((cosineSimilarities <= 0.9).all(axis=1))

那么 list_of_new_texts 中的新文本的索引在现有列表 list_of_texts_in_database.

中没有得分 > 0.9 的对应文本

将它们结合起来，您可以按如下方式进行（尽管其他人可能对此有更简洁的方法...）

print(
    list_of_texts_in_database + list(np.array(list_of_new_texts)[idx[0]])
)

输出：

['More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.',
 'UK ministers say sanctions could target companies and individuals linked to the Russian government.',
 "Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
 'He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.',
 'Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.',
 'The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control',
 'A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control',
 'That could lead to a major escalation in conflict.',
 'This is a totaly new document that needs to be added into the database one way or another.',
 'Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.',
 'UK minister say sanctions could target companies and individuals linked to the Russian government.']

Answer 2

你为什么不在数据框内工作？

import pandas as pd

d = {'old_text':list_of_texts_in_database[:5], 'new_text':list_of_new_texts, 'old_emb': list_of_texts_in_database_tfidf[:5], 'new_emb': list_of_new_texts_tfidf}

df = pd.DataFrame(data=d)

df['score'] = df.apply(lambda row: cosine_similarity(row['old_emb'], row['new_emb'])[0][0], axis=1)
df = df.loc[df.score > 0.9, 'score']
df.head()

打印余弦相似度得分小于 0.90 的文本

Print texts that have cosine similarity score less than 0.90

python

scikit-learn