Python 正则表达式 A|B|C 匹配 C，即使 B 应该匹配

Question

我已经在这个问题上坐了几个小时了，我真的不知道了...... 本质上，我有一个 A|B|C - 类型分隔的正则表达式，并且无论出于何种原因 C 匹配 B，即使应该从左到右测试各个正则表达式并以非贪婪的方式停止（即一次匹配已找到，其他正则表达式不再测试）。

这是我的代码：

text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
                    + expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])

m = re_exp.search(text)
print(m.group(0))

我想让正则表达式找到“扩展”字符串。在我的数据集中，有时文本会稍微编辑扩展字符串，例如在主要名词之间添加冠词或介词，如“for”或“the”。这就是为什么我首先尝试按原样匹配字符串，然后如果它在任何非单词字符之后尝试匹配它（即括号，或者像上面的示例一样，一大堆东西 space 被省略了），最后，我只是使用完整的通配符来查找字符串，方法是搜索字符串的开头和结尾，中间有通配符。

无论哪种方式，对于上面的示例，我都希望得到以下输出：

American Heart Association

但我得到的是

American College of Cardiology (ACC)/American Heart Association

这是最终正则表达式的匹配项。

如果我删除最后的正则表达式或只调用 re.findall(r"(?<=\W)"+ expansion, text)，我会得到我想要的输出，这意味着正则表达式实际上匹配正确。

什么给了？

Answer 1

所以 re.findall(r"(?<=\W)"+ expansion, text) 有效，因为匹配之前是一个非单词字符（表示为 \w），“/”。您的正则表达式将匹配“American [whatever random stuff here] Heart Association”。这意味着您先匹配“美国心脏病学会 (ACC)/美国心脏协会”，然后再匹配内部字符串“美国心脏协会”。例如。如果您删除了字符串中的第一个“American”，您将通过正则表达式获得您正在寻找的匹配项。

您需要对正则表达式进行更严格的限制，以排除此类情况。

Answer 2

正则表达式如下所示：

American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association

前 2 个备选方案匹配相同的文本，只有第二个备选方案前面有正向回顾。

您可以省略第二个选择，因为没有任何断言的第一个选择已经匹配它，或者如果第一个不匹配，第二部分也不会匹配它。

由于模式从左到右匹配，遇到第一次出现American，第一个和第二个备选方案无法匹配American College of Cardiology。

然后第三次交替可以匹配它，并且由于.*?它可以延伸到第一次出现Association。

例如，您可能会使用 negated character class:

排除要匹配的可能字符

\bAmerican\b[^/,.]*\bAssociation\b

Regex demo

或者您可以使用方法来不允许在第一部分和最后一部分之间使用特定的单词：

\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b

Regex demo

Python 正则表达式 A|B|C 匹配 C，即使 B 应该匹配

Python regex A|B|C matches C even though B should match

python

regex

nlp

python-re