python 中禁止代理的正则表达式

Question

我正在编写一个正则表达式来匹配以下条件：

shall not specify a character whose short identifier is less than 00A0 other than 0024 ( $ ), 0040 ( @ ), or 0060 (‘), nor one in the range D800 through DFFF inclusive.

我写了下面的正则表达式：

PATTERN = ([\u0024\u0040\u0060]|(?![\u0000-\u00A0])|(?![\u8000-\udfff]))

并将其用于如下搜索

str = #some str
search = re.search(PATTERN, str, re.UNICODE)

让我感到困惑的是 \u8000 - \udfff 是代理人

DEMO.

但是运行我的脚本中的这种正则表达式似乎工作正常。使用正则表达式过滤掉这样的字符是正确的方法吗？

Answer 1

在挖掘一些之后，我找到了这个答案：

简而言之：该范围内的字符在宽 Unicode 字符串中 根本不是东西 ，至少在 python 中是这样 3. 正则表达式的执行有效，因为开头不包含此类字符。 Python 似乎忽略了不合逻辑的命令并继续前进。但正因为如此，regex101 将其标记为错误，尽管运行没问题。

回答你的问题：是的，但也不是。它根本不会做任何事情。我建议删除 \u8000-\udfff 部分。

python 中禁止代理的正则表达式

Regex for prohibiting surrogates in python

python

regex

regex-lookarounds

python-3.7