为什么“\p{L}”在此正则表达式中不起作用？

Question

OS: Windows 7.Jython 2.7.0 "final release".

for token in sorted_cased.keys():
    freq = sorted_cased[ token ]
    if freq > 1:
        print( 'token |%s| unicode? %s' % ( token, isinstance( token, unicode ), ) )
        if re.search( ur'\p{L}+', token ):
            print( '  # cased token |%s| freq %d' % ( token, freq, ))

sorted_cased 是显示标记出现频率的字典。在这里，我试图清除出现频率 > 1 的单词（仅限 unicode 字符）。（注意我使用的是 re.match 而不是 search 但 search 应该检测事件 1 这样\p{L} 在 token)

示例输出：

token |Management| unicode? True
token |n| unicode? True
token |identifiés| unicode? True
token |décrites| unicode? True
token |agissant| unicode? True
token |tout| unicode? True
token |sociétés| unicode? True

None 识别出其中有一个 [p{L}]。我尝试了各种排列组合：双引号、添加 flags=re.UNICODE 等

以后我被要求解释为什么这不能归类为 How to implement \p{L} in python regex 的副本。它可以，但是......另一个问题的答案并没有引起人们注意使用 REGEX MODULE 的需要（旧版本？非常新的版本？注意它们是不同的）而不是到 RE 模块。为了挽救未来遇到此问题的人的毛囊和理智，我请求允许保留当前段落，尽管问题是"duped"。

我还尝试安装 Pypi 正则表达式模块 在 JYTHON 下失败（使用 pip）。使用 java.util.regex.

可能更好

Answer 1

如果您可以访问 Java java.util.regex，最好的选择是使用内置 \p{L} class.

Python（包括 Jython 方言）不支持 \p{L} 和其他 Unicode 类别 classes。也不是 POSIX 字符 classes.

另一种选择是像 (?![\d_])\w 一样限制 \w class 并使用 UNICODE 标志。 If UNICODE is set, this \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.。这种替代方案有一个缺陷：它不能在字符 class.

内使用

另一个想法是使用 [^\W\d_]（带有 re.U 标志），它将匹配任何不是非单词 (\W)、数字 (\d) 和 _ 个字符。它将有效匹配任何 Unicode 字母.

为什么“\p{L}”在此正则表达式中不起作用？

Why is "\p{L}" not working in this regex?

python

regex

unicode

jython