Python:正则表达式无法正常工作

Python: Regular Expression not working properly

我正在使用以下正则表达式,它应该找到字符串 'U.S.A.',但它只得到 'A.',有人知道哪里出了问题吗?

#INPUT
import re

text = 'That U.S.A. poster-print costs .40...'

print re.findall(r'([A-Z]\.)+', text)

#OUTPUT
['A.']

预期输出:

['U.S.A.']

我正在关注 NLTK Book,第 3.7 章 here,它有一组正则表达式,但它就是不起作用。我在 Python 2.7 和 3.4 中都试过了。

>>> text = 'That U.S.A. poster-print costs .40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | $?\d+(\.\d+)?%?  # currency and percentages, e.g. .40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '.40', '...']

nltk.regexp_tokenize()re.findall() 的工作原理相同,我想我的 python 这里没有按预期识别正则表达式。上面列出的正则表达式输出这个:

[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

删除尾随 +,或将其放入组内:

>>> text = 'That U.S.A. poster-print costs .40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.']              # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.']  # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.']          # with '+' inside the group

正则表达式匹配的文本的第一部分是 "U.S.A." 因为 ([A-Z]\.)+ 匹配第一组(括号内的部分)三次。但是,每组只能 return 一场比赛,因此 Python 选择该组的最后一场比赛。

如果您改为更改正则表达式以在组中包含“+”,则该组将只匹配一次并且将 return 编辑完整匹配项。例如 (([A-Z]\.)+)((?:[A-Z]\.)+).

如果您想要三个单独的结果,那么只需去掉正则表达式中的“+”号,它每次只会匹配一个字母和一个点。

可能与以前使用nltk.internals.compile_regexp_to_noncapturing()编译正则表达式的方式有关,在v3.1中被废除,参见here)

>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | $?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> 
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

但在NLTK v3.1中不起作用:

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | $?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

稍微修改您定义正则表达式组的方式,您可以使用此正则表达式在 NLTK v3.1 中使用相同的模式:

pattern = r"""(?x)                   # set flag to allow verbose regexps
              (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
              |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
              |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
              |(?:[+/\-@&*])         # special characters with meanings
            """

在代码中:

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x)                   # set flag to allow verbose regexps
... (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*])         # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

没有 NLTK,使用 python 的 re 模块,我们发现旧的正则表达式模式不受原生支持:

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               |$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               |\w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               |[+/\-@&*]        # special characters with meanings
...               |\S\w*                       # any sequence of word characters# 
... """            
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps
...                       (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
...                       |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
...                       |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
...                       |(?:[+/\-@&*])         # special characters with meanings
...                     """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

注意: NLTK 的 RegexpTokenizer 编译正则表达式的方式的改变也会使 NLTK's Regular Expression Tokenizer 上的示例过时。

问题是 "capturing group",也就是括号,它对 findall() 的结果有意想不到的影响:当一个捕获组在一场比赛中被多次使用时,正则表达式引擎会丢失跟踪和奇怪的事情发生。具体来说:正则表达式正确匹配整个 U.S.A.,但 findall 将其丢弃在地板上并且仅 returns 最后一个组捕获。

As this answer says, the re module doesn't support repeated capturing groups, but you could install the alternative regexp 正确处理此问题的模块。 (但是,如果您想将正则表达式传递给 nltk.tokenize.regexp,这对您没有帮助。)

无论如何要正确匹配 U.S.A.,请使用:r'(?:[A-Z]\.)+', text).

>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']

您可以对 NLTK 正则表达式中的所有重复模式应用相同的修复,一切都会正常工作。正如@alvas 所建议的那样,NLTK 过去常常在幕后进行此替换,但该功能最近被删除并在 11 月被 a warning in the documentation of the tokenizer. The book is clearly out of date; @alvas filed a bug report 取代,但尚未采取行动...