用于组合长度、包含和排除的正则表达式？

Question

用 [regex] 搜索 SO 给了我 249'446 次点击，用 [regex] inclusion exclusion 搜索给了我 47 次点击，但我猜 none 是后者（也许是一些前者？）适合我的情况。

我也知道，例如关于这个正则表达式页面 https://www.regular-expressions.info/refquick.html，但我想可能有一个我还不熟悉的正则表达式概念非常感谢您的提示。

这是我尝试对给定字符串列表执行的操作的最小示例。

查找所有项目：

具有固定定义的字符数，即长度
必须包含某个列表中的所有字符（不管在什么位置和多次）
不得包含特定列表中的任何字符

构造如：[ei^no]{4}、((?![no])[ei]){4} 和许多其他更复杂的试验没有给出预期的结果。

因此，我目前将此实现为一个 3 步过程，包括检查长度、进行搜索和匹配。这对我来说看起来很麻烦而且效率低下。

有没有更有效的方法来做到这一点？

脚本：

import re

items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve']

count          = 4
mustContain    = 'ei'   # all of these charactes at least once
mustNotContain = 'no'   # none of those chars

hits1 = []
for item in items:
    if len(item)==count:
        hits1.append(item)
print("Hits1:",hits1)

hits2 = []
for hit in hits1:
    regex = '[{}]'.format(mustContain)
    if re.search(regex,hit):
        hits2.append(hit)
print("Hits2:", hits2)

hits3 = []
for hit in hits2:
    regex = '[{}]'.format(mustNotContain)
    if re.match(regex,hit):
        hits3.append(hit)
print("Hits3:", hits3)

结果：

Hits1: ['four', 'five', 'nine']
Hits2: ['five', 'nine']
Hits3: ['five']

Answer 1

如果您对正则表达式方法感兴趣，您可以创建一个动态模式，如下所示：

^(?=.{4}$)(?![^no\n]*[no])(?=[^e\n]*e)[^i\n]*i.*$

说明

^ 字符串开头
(?=.{4}$) 声明 4 个字符
(?![^no\n]*[no]) 使用前导 negated character class

n

o

(?=[^e\n]*e) 向右声明一个 e 字符
[^i\n]*i 匹配除 i 之外的任何字符，然后匹配 i
.* 匹配行的其余部分
$ 字符串结尾

看到一个regex demo and a Python demo。

例子

import re

items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'tree']
hits = [item for item in items if re.match(r"(?=.{4}$)(?![^no\n]*[no])(?=[^e\n]*e)[^i\n]*i.*$", item)]

print(hits)

输出

['five']

使用 all 的变体和列表理解：

items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'tree']

count = 4
mustContain = ["e", "i"]  # all of these characters at least once
mustNotContain = ["n", "o"]  # none of those chars

hits = [
    item for item in items if
    len(item) == count and
    all([c in item for c in mustContain]) and
    all([c not in item for c in mustNotContain])
]
print(hits)

输出

['five']

看到一个Python demo。

Answer 2

显然，我缺少的“技巧”是“正向预测”(?=regex)。我想@Thefourthbird 解决方案中的正则表达式可以缩短，除非我忽略了什么，有人会证明我错了。包含字符的正则表达式可以动态生成。

问题的原始最小示例的正则表达式为：

^(?=.{4}$)(?!.*[no])(?=.*e)(?=.*i)

脚本：（动态生成的正则表达式）

import re

items = ['one', 'two', 'three', 'four', 'five', 'six', 
         'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 
         'tree', 'mean', 'mine', 'fine', 'dime', 'eire']

count          = 4
mustContain    = 'ei'   # all of these characters at least once
mustNotContain = 'no'   # none of those chars

hits = []
regex1 = '^(?=.{' + str(count) + '}$)'                                # limit number of chars
regex2 = '(?!.*[' + mustNotContain + '])' if mustNotContain else ''   # excluded chars
regex3 = ''.join(['(?=.*{})'.format(c) for c in mustContain])         # included chars
regex  = regex1 + regex2 + regex3

for item in items:
    if re.match(regex,item,re.IGNORECASE):
        hits.append(item)
print("Hits:", hits)

结果：

Hits: ['five', 'dime', 'eire']

用于组合长度、包含和排除的正则表达式？

regex for combining length, inclusion and exclusion?

python

regex