如何衡量句子之间的字符串相似度?

How can I measure string similarity between sentences?

我有以下任务。

给出的是一个字符串列表,如下所示:

        var strings = [
            'Steve jobs created the iPod when he was at Apple',
            'I really like the new Macbook by Apple',
            'Jony Ive was concerned being fired by Steve Jobs after his return to Apple',
            'The new Macbook has just one USB-C type connector',
            'I like bananas',
            'The brezels I can buy in my local store are much better than the ones in the supermarket',
            'the',
            'foo',
            'Steve'
        ];

我现在想比较每个字符串,对于每次比较,我想找出它们在 0-1(或 0%-100%)范围内的相似程度。

所以,我用谷歌搜索了一下,发现了这个:Similarity String Comparison in Java

所以,我按照那里的说明,将方法 similarity(String s1, String s2) 移植到 JavaScript:

        function similarity(s1, s2) {
            var longer = s1;
            var shorter = s2;
            if (s1.length < s2.length) {
                longer = s2;
                shorter = s1;
            }
            var longerLength = longer.length;
            if (longerLength == 0) {
                return 1.0;
            }
            return (longerLength - longer.LevenshteinDistance(shorter)) / longerLength;
        }

作为比较算法,我使用了 Levenshtein:

        String.prototype.LevenshteinDistance = function (s2) {
            var array = new Array(this.length + 1);
            for (var i = 0; i < this.length + 1; i++)
                array[i] = new Array(s2.length + 1);

            for (var i = 0; i < this.length + 1; i++)
                array[i][0] = i;
            for (var j = 0; j < s2.length + 1; j++)
                array[0][j] = j;

            for (var i = 1; i < this.length + 1; i++) {
                for (var j = 1; j < s2.length + 1; j++) {
                    if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
                    else {
                        array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                        array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
                    }
                }
            }
            return array[this.length][s2.length];
        };

所以,作为测试,我 运行 一个完整的循环比较每个字符串并像这样打印结果:

            for (var i in strings){
                var s = strings[i];
                print('Checking string: "' + s + '"');
                for (var j in strings){
                    print('-----');
                    var s2 = strings[j];
                    print('vs "' + s2 + '"');
                    var sim = similarity(s, s2);
                    print('Similarity: ' + Math.round(sim*100) + '%');
                }
                print('<br>////// NEXT /////////////////////////////////////////////////<br>');
            }

好的,现在这是结果:https://jsfiddle.net/wxksfa4w/

现在,查看结果,我得到了一些很好的匹配,但也有一些完全不相关,例如:

"Steve jobs created the iPod when he was at Apple" 和 "I like bananas" 匹配 13%?

"Steve jobs created the iPod when he was at Apple" 和 "Steve" 仅匹配 10%,尽管在第一句中使用了完全相同的单词 "Steve"?

如何获得更好的语义结果? Levenshtein 是错误的算法吗?据我了解,Levenshtein 计算了如何将句子 1 更改为句子 2 的步骤数。因此,即使存在语义相似性,字符串的长度似乎也会对结果产生重大影响。

有什么建议吗?

您可能应该将两个句子中出现的单词作为高度相似性的暗示。一种简单的方法是将每个句子用作单词袋并使用 tf-idf

您可以使用归一化最长公共子序列 (LCS) 相似度:计算最长公共子序列的长度,然后除以最小字符串的长度。

顺便说一句,最长公共子序列不应与最长公共子串混淆:对于两个字符串 "This is a long string" 和 "This is another string, really..."

最长的公共子序列是"This is a string"
最长公共子串是"This is a"

相对LCS相似度为16/21 = 0.76

您可以在此处找到 Java LCS 相似性的实现:https://github.com/tdebatty/java-string-similarity

并且 Java脚本实现在维基教科书上可用:https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_subsequence#JavaScript

SimMetrics has java code for the Smith Waterman Gotoh algorithm which is great for comparing string sentences. I've found Smith Waterman Gotoh to be the superior algorithm for comparing larger strings such as sentences and article titles