正则表达式仅在不在用户名中时才匹配“_”字符

Question

上下文和解释

我正在做一个电报机器人，我想在每个不在用户名中的 "_" 字符（以 "@" 开头的单词）之前添加 excape 字符 "\"像 "@username_"，以防止一些降价错误（事实上，在电报中，"_" 字符用于使字符串变为斜体）。

所以，例如，有这个字符串：

"hello i like this char _ write me lol_ @myusername_"

我只想匹配前两个 "_" 个字符而不匹配第三个

问题

使用正则表达式模式执行此操作的正确方法是什么？

预期条件和匹配

Condition	Match
`"_"` alone: (`"_"`)	YES
`"_"` in a word without `"@"`: (`"lol_"`)	YES
`"_"` in a word starting with `"@"`: (`"@username_"`)	NO
`"_"` in a word containing `"@"` after the `"@"`: (`"lol@username_"`)	NO
`"_"` in a word containing `"@"` before the `"@"`: (`"lol_@username"`)	YES
`"_"` in a world like: (`"lol_@username_"`)	first: YES second: NO

我试过的

到目前为止我已经知道了，但是它不能正常工作：

"(?=[^@]+)(?:\s[^\s]*(_)[^\s]*\s)"

编辑

我还希望在这个字符串中："lol_@username_" 第一个字符 "_" 被匹配

Answer 1

我假设您只关心 @ 在单词的 start 处。您可以使用 re.sub 以及 replace 和 (?:\s|^)[^@]\S+\b 来匹配符合您的规范的词：

import re

s = "hello i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
s = re.sub(r"(?:\s|^)[^@]\S*\b", lambda x: x.group().replace("_", r"\_"), s)
print(s) # => hello i like this char \_ write me lol\_ @myusername_ asd@\_a @_asdf

如果您关心 @ 出现在 任何地方 中，请尝试 (?:\s|^)[^@\s]+\b:

s = "he_llo i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
s = re.sub(r"(?:\s|^)[^@\s]+\b", lambda x: x.group().replace("_", r"\_"), s)
print(s) # => he\_llo i like this char \_ write me lol\_ @myusername_ asd@_a @_asdf

根据 OP 评论，听起来最新的规范是转义 _，除了 @ 之后的任何地方，一句话：

>>> s = "he_llo i lol_@username_ _ write me lol_ @myusername_ asd@_a @_asdf"
>>> re.sub(r"(?:\s|^)[^@]+@", lambda x: x.group().replace("_", r"\_"), s)
'he\_llo i lol\_@username_ \_ write me lol\_ @myusername_ asd@_a @_asdf'

Answer 2

使用 PyPi 正则表达式库提取：

import regex
string = "hello i like this char _ write me lol_ @myusername_"
print(regex.findall(r'(?<!\S)@\w+(*SKIP)(*F)|_', string))
# ['_', '_']

见Python proof。

说明

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  @                        '@'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount  possible))
--------------------------------------------------------------------------------
  (*SKIP)(*F)              skip the match, search from the failure location
--------------------------------------------------------------------------------
  |                        or
--------------------------------------------------------------------------------
  _                        a '_' char

删除与 re:

import re
string = "hello i like this char _ write me lol_ @myusername_"
print(re.sub(r'(?<!\S)(@\w+)|_', r'', string))
# hello i like this char  write me lol @myusername_

见Python proof。

将替换为re:

import re
string = "hello i like this char _ write me lol_ @myusername_"
print(re.sub(r'(?<!\S)(@\w+)|_', lambda x: x.group(1) or "-", string))
# hello i like this char - write me lol- @myusername_

见another Python proof。

Answer 3

您可以在匹配 @ 之后匹配所有非空白字符，并使用交替捕获组中的 _。如果是re.sub的回调，检查组1是否存在

如果是，return 转义的下划线或转义的第 1 组值（也是下划线），否则 return 保持不变的匹配。

@\S+|(_)

Regex demo

import re

strings = [
    "_",
    "lol_",
    "@username_",
    "lol@username_",
    "lol_@username",
    "lol_@username_"
]

for s in strings:
    result = re.sub(
        r"@\S+|(_)",
        lambda x: x.group(1).replace("_", r"\_") if x.group(1) else x.group(),
        s
    )
    print(result)

输出

\_
lol\_
@username_
lol@username_
lol\_@username
lol\_@username_

Answer 4

根据@OlvinRoght 的评论，稍作修改，这应该可以解决问题：

正则表达式

((?:^|\s)(?:[^@\s]*?))(_)((?:[^@\s]*?))(?=@|\s|$)

代码示例

import re

text = '_hi hello i like this char _ write me lol_ _word something_ @myusername_ something_@username_'

regex = r"((?:^|\s)(?:[^@\s]*?))(_)((?:[^@\s]*?))(?=@|\s|$)"

# Leave the first and last capturing group as-is and replace the underscore with '\_'
subst = "\1\\_\3"

print( re.sub(regex, subst, text) )

预期输出：

\_hi hello i like this char \_ write me lol\_ \_word something\_ @myusername_ something\_@username_

演示

See it live

注：

虽然这可行，但@TheFourthBird 的回答更快。（我认为更优雅。）

正则表达式仅在不在用户名中时才匹配“_”字符

Regex match "_" char only if it isn't in a username

python

regex

python-3.x

regex-lookarounds

python-re