Python:正则表达式无法正常工作
Python: Regular Expression not working properly
我正在使用以下正则表达式,它应该找到字符串 'U.S.A.'
,但它只得到 'A.'
,有人知道哪里出了问题吗?
#INPUT
import re
text = 'That U.S.A. poster-print costs .40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
预期输出:
['U.S.A.']
我正在关注 NLTK Book,第 3.7 章 here,它有一组正则表达式,但它就是不起作用。我在 Python 2.7 和 3.4 中都试过了。
>>> text = 'That U.S.A. poster-print costs .40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | $?\d+(\.\d+)?%? # currency and percentages, e.g. .40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '.40', '...']
nltk.regexp_tokenize() 与 re.findall() 的工作原理相同,我想我的 python 这里没有按预期识别正则表达式。上面列出的正则表达式输出这个:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
删除尾随 +
,或将其放入组内:
>>> text = 'That U.S.A. poster-print costs .40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.'] # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.'] # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.'] # with '+' inside the group
正则表达式匹配的文本的第一部分是 "U.S.A." 因为 ([A-Z]\.)+
匹配第一组(括号内的部分)三次。但是,每组只能 return 一场比赛,因此 Python 选择该组的最后一场比赛。
如果您改为更改正则表达式以在组中包含“+”,则该组将只匹配一次并且将 return 编辑完整匹配项。例如 (([A-Z]\.)+)
或 ((?:[A-Z]\.)+)
.
如果您想要三个单独的结果,那么只需去掉正则表达式中的“+”号,它每次只会匹配一个字母和一个点。
可能与以前使用nltk.internals.compile_regexp_to_noncapturing()
编译正则表达式的方式有关,在v3.1中被废除,参见here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | $?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
但在NLTK v3.1
中不起作用:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | $?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
稍微修改您定义正则表达式组的方式,您可以使用此正则表达式在 NLTK v3.1 中使用相同的模式:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-@&*]) # special characters with meanings
"""
在代码中:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
没有 NLTK,使用 python 的 re
模块,我们发现旧的正则表达式模式不受原生支持:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-@&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
注意: NLTK 的 RegexpTokenizer 编译正则表达式的方式的改变也会使 NLTK's Regular Expression Tokenizer 上的示例过时。
问题是 "capturing group",也就是括号,它对 findall()
的结果有意想不到的影响:当一个捕获组在一场比赛中被多次使用时,正则表达式引擎会丢失跟踪和奇怪的事情发生。具体来说:正则表达式正确匹配整个 U.S.A.
,但 findall
将其丢弃在地板上并且仅 returns 最后一个组捕获。
As this answer says, the re
module doesn't support repeated capturing groups, but you could install the alternative regexp 正确处理此问题的模块。 (但是,如果您想将正则表达式传递给 nltk.tokenize.regexp
,这对您没有帮助。)
无论如何要正确匹配 U.S.A.
,请使用:r'(?:[A-Z]\.)+', text)
.
>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']
您可以对 NLTK 正则表达式中的所有重复模式应用相同的修复,一切都会正常工作。正如@alvas 所建议的那样,NLTK 过去常常在幕后进行此替换,但该功能最近被删除并在 11 月被 a warning in the documentation of the tokenizer. The book is clearly out of date; @alvas filed a bug report 取代,但尚未采取行动...
我正在使用以下正则表达式,它应该找到字符串 'U.S.A.'
,但它只得到 'A.'
,有人知道哪里出了问题吗?
#INPUT
import re
text = 'That U.S.A. poster-print costs .40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
预期输出:
['U.S.A.']
我正在关注 NLTK Book,第 3.7 章 here,它有一组正则表达式,但它就是不起作用。我在 Python 2.7 和 3.4 中都试过了。
>>> text = 'That U.S.A. poster-print costs .40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | $?\d+(\.\d+)?%? # currency and percentages, e.g. .40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '.40', '...']
nltk.regexp_tokenize() 与 re.findall() 的工作原理相同,我想我的 python 这里没有按预期识别正则表达式。上面列出的正则表达式输出这个:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
删除尾随 +
,或将其放入组内:
>>> text = 'That U.S.A. poster-print costs .40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.'] # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.'] # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.'] # with '+' inside the group
正则表达式匹配的文本的第一部分是 "U.S.A." 因为 ([A-Z]\.)+
匹配第一组(括号内的部分)三次。但是,每组只能 return 一场比赛,因此 Python 选择该组的最后一场比赛。
如果您改为更改正则表达式以在组中包含“+”,则该组将只匹配一次并且将 return 编辑完整匹配项。例如 (([A-Z]\.)+)
或 ((?:[A-Z]\.)+)
.
如果您想要三个单独的结果,那么只需去掉正则表达式中的“+”号,它每次只会匹配一个字母和一个点。
可能与以前使用nltk.internals.compile_regexp_to_noncapturing()
编译正则表达式的方式有关,在v3.1中被废除,参见here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | $?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
但在NLTK v3.1
中不起作用:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | $?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
稍微修改您定义正则表达式组的方式,您可以使用此正则表达式在 NLTK v3.1 中使用相同的模式:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-@&*]) # special characters with meanings
"""
在代码中:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
没有 NLTK,使用 python 的 re
模块,我们发现旧的正则表达式模式不受原生支持:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-@&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
注意: NLTK 的 RegexpTokenizer 编译正则表达式的方式的改变也会使 NLTK's Regular Expression Tokenizer 上的示例过时。
问题是 "capturing group",也就是括号,它对 findall()
的结果有意想不到的影响:当一个捕获组在一场比赛中被多次使用时,正则表达式引擎会丢失跟踪和奇怪的事情发生。具体来说:正则表达式正确匹配整个 U.S.A.
,但 findall
将其丢弃在地板上并且仅 returns 最后一个组捕获。
As this answer says, the re
module doesn't support repeated capturing groups, but you could install the alternative regexp 正确处理此问题的模块。 (但是,如果您想将正则表达式传递给 nltk.tokenize.regexp
,这对您没有帮助。)
无论如何要正确匹配 U.S.A.
,请使用:r'(?:[A-Z]\.)+', text)
.
>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']
您可以对 NLTK 正则表达式中的所有重复模式应用相同的修复,一切都会正常工作。正如@alvas 所建议的那样,NLTK 过去常常在幕后进行此替换,但该功能最近被删除并在 11 月被 a warning in the documentation of the tokenizer. The book is clearly out of date; @alvas filed a bug report 取代,但尚未采取行动...