确定字符串在 Python 中是否有 3 个或更多重复的连续字符
Determine if string has 3 or more duplicate sequential characters in Python
我正在经历将近 1200 亿个字符串组合。我正在尝试找到确定所讨论的字符串是否具有 3 个(或更多)连续重复字符的最快速度优化方法。
例如:
string = "blah"
测试应该return false。
string = "blaaah"
这会 return 正确。
我成功地实现了一个基本的 for 循环,循环遍历每个字符串的字符并比较下一个字符是否匹配。这行得通,但是对于我过滤的字符串数量,我真的很想优化它。
有什么建议吗?谢谢!
通过re
模块。
>>> def consecutive(string):
if re.search(r'(.)', string):
print('True')
else:
print('False')
>>> consecutive('blah')
False
>>> consecutive('blaah')
False
>>> consecutive('blaaah')
True
>>> consecutive('blaaaah')
True
()
称为捕获组,用于捕获与该组中存在的模式匹配的字符。 </code> 向后引用捕获 group.In 字符串 <code>blaaah
中存在的字符,(.)
捕获第一个 a
并检查直接出现的两次 a
。所以 aaa
匹配了。
您可以在此处使用 itertools.groupby()
。您仍然需要扫描字符串,但正则表达式也是如此:
from itertools import groupby
three_or_more = (char for char, group in groupby(input_string)
if sum(1 for _ in group) >= 3)
这会产生一个发电机;遍历它以列出找到 3 次或更多次的所有字符,或使用 any()
查看是否至少有一个这样的组:
if any(three_or_more):
# found at least one group of consecutive characters that
# consists of 3 or more.
不幸的是,re
解决方案在这里更有效:
>>> from timeit import timeit
>>> import random
>>> from itertools import groupby
>>> import re
>>> import string
>>> def consecutive_groupby(string):
... three_or_more = (char for char, group in groupby(string)
... if sum(1 for _ in group) >= 3)
... return any(three_or_more)
...
>>> def consecutive_re(string):
... return re.search(r'(.)', string) is not None
...
>>> # worst-case: random data with no consecutive strings
...
>>> test_string = ''.join([random.choice(string.ascii_letters) for _ in range(1000)])
>>> consecutive_re(test_string), consecutive_groupby(test_string)
(False, False)
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_re as consecutive', number=10000)
0.19730806350708008
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_groupby as consecutive', number=10000)
4.633949041366577
>>> # insert repeated characters
...
>>> test_string_with_repeat = test_string[:100] + 'aaa' + test_string[100:]
>>> consecutive_re(test_string_with_repeat), consecutive_groupby(test_string_with_repeat)
(True, True)
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_re as consecutive', number=10000)
0.03344106674194336
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_groupby as consecutive', number=10000)
0.4827418327331543
Avinash 给出的正则表达式方法显然是这里的赢家,这表明您应该始终衡量备选方案。
您可以定义一个捕获组模式,然后重复搜索它:
import re
s = 'blaaah'
p = '(?P<g>.)(?P=g){2}'
m = re.search(p, s, re.M)
print(m).group(0)
结果:
aaa
我正在经历将近 1200 亿个字符串组合。我正在尝试找到确定所讨论的字符串是否具有 3 个(或更多)连续重复字符的最快速度优化方法。
例如:
string = "blah"
测试应该return false。
string = "blaaah"
这会 return 正确。
我成功地实现了一个基本的 for 循环,循环遍历每个字符串的字符并比较下一个字符是否匹配。这行得通,但是对于我过滤的字符串数量,我真的很想优化它。
有什么建议吗?谢谢!
通过re
模块。
>>> def consecutive(string):
if re.search(r'(.)', string):
print('True')
else:
print('False')
>>> consecutive('blah')
False
>>> consecutive('blaah')
False
>>> consecutive('blaaah')
True
>>> consecutive('blaaaah')
True
()
称为捕获组,用于捕获与该组中存在的模式匹配的字符。 </code> 向后引用捕获 group.In 字符串 <code>blaaah
中存在的字符,(.)
捕获第一个 a
并检查直接出现的两次 a
。所以 aaa
匹配了。
您可以在此处使用 itertools.groupby()
。您仍然需要扫描字符串,但正则表达式也是如此:
from itertools import groupby
three_or_more = (char for char, group in groupby(input_string)
if sum(1 for _ in group) >= 3)
这会产生一个发电机;遍历它以列出找到 3 次或更多次的所有字符,或使用 any()
查看是否至少有一个这样的组:
if any(three_or_more):
# found at least one group of consecutive characters that
# consists of 3 or more.
不幸的是,re
解决方案在这里更有效:
>>> from timeit import timeit
>>> import random
>>> from itertools import groupby
>>> import re
>>> import string
>>> def consecutive_groupby(string):
... three_or_more = (char for char, group in groupby(string)
... if sum(1 for _ in group) >= 3)
... return any(three_or_more)
...
>>> def consecutive_re(string):
... return re.search(r'(.)', string) is not None
...
>>> # worst-case: random data with no consecutive strings
...
>>> test_string = ''.join([random.choice(string.ascii_letters) for _ in range(1000)])
>>> consecutive_re(test_string), consecutive_groupby(test_string)
(False, False)
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_re as consecutive', number=10000)
0.19730806350708008
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_groupby as consecutive', number=10000)
4.633949041366577
>>> # insert repeated characters
...
>>> test_string_with_repeat = test_string[:100] + 'aaa' + test_string[100:]
>>> consecutive_re(test_string_with_repeat), consecutive_groupby(test_string_with_repeat)
(True, True)
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_re as consecutive', number=10000)
0.03344106674194336
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_groupby as consecutive', number=10000)
0.4827418327331543
Avinash 给出的正则表达式方法显然是这里的赢家,这表明您应该始终衡量备选方案。
您可以定义一个捕获组模式,然后重复搜索它:
import re
s = 'blaaah'
p = '(?P<g>.)(?P=g){2}'
m = re.search(p, s, re.M)
print(m).group(0)
结果:
aaa