在列表中查找重复子串
Find Repeating Substring In a List
我有一长串子字符串(接近 16000 个),我想找到重复循环 starts/stops 的位置。我想出了这段代码作为起点:
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
pat = [ '1100100100000010',
'1001001000000110',
'0010010000001100',]
for i in range(0,len(strings)-1):
for j in range(0,len(pat)):
if strings[i] == pat[j]:
continue
if strings[i+1] == pat[j]:
print 'match', strings[i]
break
break
此方法的问题是您必须知道 pat 是什么才能搜索它。我希望能够从第一个 n 子列表(在本例中为 3)开始并搜索它们,如果不匹配,则将一个子字符串向下移动到下一个 3,直到它遍历整个列表或找到重复。我相信如果长度足够长(可能是 10),它会找到重复而不需要太多时间。
这里有一些东西可以找到在字符串数组中匹配的所有子数组。
strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']
pat = ['A', 'B', 'C', 'D']
i = 0
while i < len(strings):
if strings[i] not in pat:
i += 1
continue
matches = 0
for j in xrange(pat.index(strings[i]), len(pat)):
if i + j - pat.index(strings[i]) >= len(strings):
break
if strings[i + j - pat.index(strings[i])] == pat[j]:
matches += 1
else:
break
if matches:
print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
i += matches
else:
i += 1
输出:
matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A
这是一种相当简单的方法,可以找到所有长度 >= 1 的所有匹配项:
def findall(xs):
from itertools import combinations
# x2i maps each member of xs to a list of all the
# indices at which that member appears.
x2i = {}
for i, x in enumerate(xs):
x2i.setdefault(x, []).append(i)
n = len(xs)
for ixs in x2i.values():
if len(ixs) > 1:
for i, j in combinations(ixs, 2):
length = 1 # xs[i] == xs[j]
while (i + length < n and
j + length < n and
xs[i + length] == xs[j + length]):
length += 1
yield i, j, length
然后:
for i, j, n in findall(strings):
print("match of length", n, "at indices", i, "and", j)
显示:
match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8
您想要做什么和不想要什么尚未明确指定,因此这里列出了 所有 个匹配项。你可能真的不想要其中的一些。例如,在索引 1 和 7 处长度为 3 的匹配只是在索引 0 和 6 处长度为 4 的匹配的尾部。
因此您需要更改代码来计算您真正想要的内容。也许您只想要一个最大的匹配?所有最大匹配?仅匹配特定长度?等等
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
n = 3
patt_dict = {}
for i in range(0, len(strings) - n, 1):
patt = (' '.join(strings[i:i + n]))
if patt not in patt_dict.keys(): patt_dict[patt] = 1
else: patt_dict[patt] += 1
for key in patt_dict.keys():
if patt_dict[key] > 1:
print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'
试一试。在线性时间内运行。基本上使用字典来计算子集中出现 n 大小模式的次数。如果它超过 1,那么我们就有了一个重复模式 :)
我有一长串子字符串(接近 16000 个),我想找到重复循环 starts/stops 的位置。我想出了这段代码作为起点:
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
pat = [ '1100100100000010',
'1001001000000110',
'0010010000001100',]
for i in range(0,len(strings)-1):
for j in range(0,len(pat)):
if strings[i] == pat[j]:
continue
if strings[i+1] == pat[j]:
print 'match', strings[i]
break
break
此方法的问题是您必须知道 pat 是什么才能搜索它。我希望能够从第一个 n 子列表(在本例中为 3)开始并搜索它们,如果不匹配,则将一个子字符串向下移动到下一个 3,直到它遍历整个列表或找到重复。我相信如果长度足够长(可能是 10),它会找到重复而不需要太多时间。
这里有一些东西可以找到在字符串数组中匹配的所有子数组。
strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']
pat = ['A', 'B', 'C', 'D']
i = 0
while i < len(strings):
if strings[i] not in pat:
i += 1
continue
matches = 0
for j in xrange(pat.index(strings[i]), len(pat)):
if i + j - pat.index(strings[i]) >= len(strings):
break
if strings[i + j - pat.index(strings[i])] == pat[j]:
matches += 1
else:
break
if matches:
print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
i += matches
else:
i += 1
输出:
matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A
这是一种相当简单的方法,可以找到所有长度 >= 1 的所有匹配项:
def findall(xs):
from itertools import combinations
# x2i maps each member of xs to a list of all the
# indices at which that member appears.
x2i = {}
for i, x in enumerate(xs):
x2i.setdefault(x, []).append(i)
n = len(xs)
for ixs in x2i.values():
if len(ixs) > 1:
for i, j in combinations(ixs, 2):
length = 1 # xs[i] == xs[j]
while (i + length < n and
j + length < n and
xs[i + length] == xs[j + length]):
length += 1
yield i, j, length
然后:
for i, j, n in findall(strings):
print("match of length", n, "at indices", i, "and", j)
显示:
match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8
您想要做什么和不想要什么尚未明确指定,因此这里列出了 所有 个匹配项。你可能真的不想要其中的一些。例如,在索引 1 和 7 处长度为 3 的匹配只是在索引 0 和 6 处长度为 4 的匹配的尾部。
因此您需要更改代码来计算您真正想要的内容。也许您只想要一个最大的匹配?所有最大匹配?仅匹配特定长度?等等
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
n = 3
patt_dict = {}
for i in range(0, len(strings) - n, 1):
patt = (' '.join(strings[i:i + n]))
if patt not in patt_dict.keys(): patt_dict[patt] = 1
else: patt_dict[patt] += 1
for key in patt_dict.keys():
if patt_dict[key] > 1:
print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'
试一试。在线性时间内运行。基本上使用字典来计算子集中出现 n 大小模式的次数。如果它超过 1,那么我们就有了一个重复模式 :)