如何在相似度得分为 python 的大字符串中找到相似子串？

Question

我要找的不仅仅是两个文本之间的简单相似度分数。但是字符串中子字符串的相似度得分。说：

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

在上面的例子中，text1的所有单词都完全出现在text2中，因此相似度应该是100%。

如果text1的部分单词缺失，则得分较低

我正在处理一个具有不同段落大小的大型数据集，因此在具有如此相似性得分的较大段落中找到较小的段落至关重要。

我只发现了比较两个字符串的字符串相似性，例如余弦相似性、difflib 相似性等。但不是关于另一个字符串中的子字符串分数。

Answer 1

根据您的描述，如何：

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

万一它有一个遗漏的词，例如：

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

此外，您可以按照@roadrunner 的建议，拆分 b 并将其保存为一组，以加快您使用 b = set(b.split(" ")) 的性能。这会将该部分的复杂度降低到 O(1)，并将整个算法提高到 O(n) 的复杂度。

编辑： 你说你已经尝试了一些指标，比如余弦相似度等。但是我怀疑你可能会从检查 Levenshtein Distance 相似度中受益，我怀疑这可能是在这种情况下，一些用途作为所提供解决方案的补充。

Answer 2

与 DarkCygbus 类似，但相似性是基于其总字符数而不是单词数。另一方面，这个脚本只检查了与完整单词 (text_2.split())

的一致性

from __future__ import division

text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0

for word in text_1.split():
    if word not in text_2.split():
        no_match += len(word)
    else:
        match += len(word)

similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))

Answer 3

您还可以使用 collections.defaultdict 来存储 word_a 中存在于 word_b 中的单词计数，然后 sum() 将计数除以 word_a 最后：

from collections import defaultdict

a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"

word_a = a.split()
word_b = set(b.split())

d = defaultdict(int)
for word in word_a:
    if word in word_b:
        d[word] += 1

print(sum(d.values()) / len(word_a))

哪些输出：

0.875

注：因为我们只关心检查 word_a 中的单词是否存在于 word_b 中，然后将 word_b 转换为 set() 将允许 O(1) 查找，而不是将其保留为列表，这将是 O(n)。这就使得上述代码的整体时间复杂度O(n)。

如何在相似度得分为 python 的大字符串中找到相似子串？

How to find a similar substring inside a large string with a similarity score in python?

python

string

nlp

distance

similarity