python 中具有间隙约束的非重叠模式匹配

Question

我想找到总数。 非重叠 序列中出现的模式匹配，间隙约束为 2。

例如。 2982f 2982l 2981l 是使用某种算法找到的模式。我必须找到出现在 2982f 2982f 2982l 2982l 2981l 3111m 3171f 2982f 2982l 2981l … 等序列中的此模式的总数，其中最大间隙约束为 2。

Gap constraint 2 表示在模式 2982f 2982l 2981l 之间，最多允许有 2 个其他单词。而且，最主要的是所有这些匹配都应该是非重叠的。

例如对于序列 2982f 2982f 2982l 2982l 2981l 中的模式 '2982f 2982l 2981l :

2982f 2982f 2982l 2982l 2981l 匹配
2982f 2982l 2982l 2981l 是另一场比赛

所以，这个模式出现了两次，但是我应该把它算作一个，因为这个匹配是重叠的。

到目前为止，我存储了所有索引，其中出现了模式中的单词。

pt = '2982f  2982l  2981l'

seq = '2982f  2982f  2982l  2982l  2981l  3111m 3171f  2982f  2982l  2981l  2752l 2982f  2771f  2771l  2982l  2981l  2981l 3211f 3342f 3341l 3411f 3441f 2982f  2731f  2742f  2982l  2822f  2981l 2811f 2982f  3001f 2992f 2992m  2982l  2981l'

pt_split = pt.split()
pt_dic = collections.OrderedDict()
for i in pt_split:
    pt_dic[i] = []

count_seq = 0
for i in seq.split():
    if i in pt_dic:
        pt_dic[i].append(count_seq)
    count_seq += 1

print pt_dic

输出：

OrderedDict([('2982f', [0, 1, 7, 11, 22, 29]), ('2982l', [2, 3, 8, 14, 25, 33]), ('2981l', [4, 9, 15, 16, 27, 34])])

现在我的想法是，我想以一种可以提取所有非重叠匹配项的方式减去索引，同时牢记间隙约束。但是，我无法理解如何从这一点开始。

有人可以帮忙解决这个问题，或者提供更好的解决方案吗？这真的很有帮助。谢谢。

Answer 1

我认为您已经完成了最困难的部分。现在只需遍历第一个单词的索引，寻找小于间隙的第二个单词的索引，依此类推。然后返回第一个单词的索引，如果您上次找到匹配项，则跳过属于该匹配项的所有索引。

例如，这里是pt中只有两个词的解决方案：

i=0
while i < len(pt_dic[pt_split[0]]):
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)
    j=0
    while j < len(pt_dic[pt_split[1]]):
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)
        if jj > ii and jj <= ii + 2:
            print "Match: (" + str(ii) + "," + str(jj) + ")"
            # Now that we've found a match, skip indices within that match.
            i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > jj)
            i -= 1 # counteract the increment at the end of the outer loop
            break
        j += 1
    i += 1
    #print "i=" + str(i)

在pt中加上三个字：

i=0
while i < len(pt_dic[pt_split[0]]):
    match=False
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)

    # Start loop at next index after ii
    j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
    while j < len(pt_dic[pt_split[1]]) and not match:
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)
        if jj > ii and jj <= ii + 2:

            # Start loop at next index after jj
            k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
            while k < len(pt_dic[pt_split[2]]) and not match:
                kk = pt_dic[pt_split[2]][k]
                #print "kk=" + str(kk)
                if kk > jj and kk <= jj + 2:
                    print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + ")"
                    # Now that we've found a match, skip indices within that match.
                    i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > kk)
                    i -= 1 # counteract the increment at the end of the outer loop
                    match=True
                k += 1
        j += 1
    i += 1

    #print "i=" + str(i)

在pt中有四个字：

i=0
while i < len(pt_dic[pt_split[0]]):
    match=False
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)

    # Start loop at next index after ii
    j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
    while j < len(pt_dic[pt_split[1]]) and not match:
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)
        if jj > ii and jj <= ii + 2:

            # Start loop at next index after ii
            k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
            while k < len(pt_dic[pt_split[2]]) and not match:
                kk = pt_dic[pt_split[2]][k]
                #print "kk=" + str(kk)
                if kk > jj and kk <= jj + 2:

                    # Start loop at next index after kk
                    l = next(x[0] for x in enumerate(pt_dic[pt_split[3]]) if x[1] > kk)
                    while l < len(pt_dic[pt_split[2]]) and not match:
                        ll = pt_dic[pt_split[3]][l]
                        #print "ll=" + str(ll)
                        if ll > kk and ll <= kk + 2:
                            print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + "," + str(ll) + ")"
                            # Now that we've found a match, skip indices within that match.
                            i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > ll)
                            i -= 1
                            match=True
                        l += 1
            k += 1
        j += 1
    i += 1

    #print "i=" + str(i)

我认为现在已经建立了模式，所以泛化到任意数量的单词作为 reader 的练习！

Answer 2

这可以用正则表达式优雅地解决。我们只需要将 pattern 转换为正则表达式，然后计算该正则表达式在输入序列中匹配的频率。

例如，给定输入 pattern = 'A B C' 和 max_gap = 2，我们想创建像

这样的正则表达式

A(arbitrary_word){,2}?B(arbitrary_word){,2}?C

可以用(?:\S+\s+)匹配以空格分隔的任意单词，所以我们得到：

import re

def count_matches(pattern, seq, max_gap):
    parts = map(re.escape, pattern.split())
    sep = r'\s+(?:\S+\s+){{,{}}}?'.format(max_gap)
    regex = r'\b{}\b'.format(sep.join(parts))
    return sum(1 for _ in re.finditer(regex, seq))

测试运行：

count_matches('2982f  2982l  2981l', '2982f  2982f  2982l  2982l  2981l', 2)
# result: 1

count_matches('A B C', 'A B D E C B A B A B C', 2)
# result: 2

python 中具有间隙约束的非重叠模式匹配

Non overlapping pattern matching with gap constraint in python

python

data-mining

pattern-matching