正则表达式仅在不在用户名中时才匹配“_”字符

Regex match "_" char only if it isn't in a username

上下文和解释

我正在做一个电报机器人,我想在每个不在用户名中的 "_" 字符(以 "@" 开头的单词)之前添加 excape 字符 "\""@username_",以防止一些降价错误(事实上,在电报中,"_" 字符用于使字符串变为斜体)。

所以,例如,有这个字符串:

"hello i like this char _ write me lol_ @myusername_"

我只想匹配前两个 "_" 个字符而不匹配第三个


问题

使用正则表达式模式执行此操作的正确方法是什么?


预期条件和匹配

Condition Match
"_" alone: ("_") YES
"_" in a word without "@": ("lol_") YES
"_" in a word starting with "@": ("@username_") NO
"_" in a word containing "@" after the "@": ("lol@username_") NO
"_" in a word containing "@" before the "@": ("lol_@username") YES
"_" in a world like: ("lol_@username_") first: YES second: NO

我试过的

到目前为止我已经知道了,但是它不能正常工作:

"(?=[^@]+)(?:\s[^\s]*(_)[^\s]*\s)"

编辑

我还希望在这个字符串中:"lol_@username_" 第一个字符 "_" 被匹配

我假设您只关心 @ 在单词的 start 处。您可以使用 re.sub 以及 replace(?:\s|^)[^@]\S+\b 来匹配符合您的规范的词:

import re

s = "hello i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
s = re.sub(r"(?:\s|^)[^@]\S*\b", lambda x: x.group().replace("_", r"\_"), s)
print(s) # => hello i like this char \_ write me lol\_ @myusername_ asd@\_a @_asdf

如果您关心 @ 出现在 任何地方 中,请尝试 (?:\s|^)[^@\s]+\b:

s = "he_llo i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
s = re.sub(r"(?:\s|^)[^@\s]+\b", lambda x: x.group().replace("_", r"\_"), s)
print(s) # => he\_llo i like this char \_ write me lol\_ @myusername_ asd@_a @_asdf

根据 OP 评论,听起来最新的规范是转义 _,除了 @ 之后的任何地方,一句话:

>>> s = "he_llo i lol_@username_ _ write me lol_ @myusername_ asd@_a @_asdf"
>>> re.sub(r"(?:\s|^)[^@]+@", lambda x: x.group().replace("_", r"\_"), s)
'he\_llo i lol\_@username_ \_ write me lol\_ @myusername_ asd@_a @_asdf'

使用 PyPi 正则表达式库提取

import regex
string = "hello i like this char _ write me lol_ @myusername_"
print(regex.findall(r'(?<!\S)@\w+(*SKIP)(*F)|_', string))
# ['_', '_']

Python proof

说明

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  @                        '@'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount  possible))
--------------------------------------------------------------------------------
  (*SKIP)(*F)              skip the match, search from the failure location
--------------------------------------------------------------------------------
  |                        or
--------------------------------------------------------------------------------
  _                        a '_' char

删除re:

import re
string = "hello i like this char _ write me lol_ @myusername_"
print(re.sub(r'(?<!\S)(@\w+)|_', r'', string))
# hello i like this char  write me lol @myusername_

Python proof

替换为re:

import re
string = "hello i like this char _ write me lol_ @myusername_"
print(re.sub(r'(?<!\S)(@\w+)|_', lambda x: x.group(1) or "-", string))
# hello i like this char - write me lol- @myusername_

another Python proof

您可以在匹配 @ 之后匹配所有非空白字符,并使用交替捕获组中的 _。如果是re.sub的回调,检查组1是否存在

如果是,return 转义的下划线或转义的第 1 组值(也是下划线),否则 return 保持不变的匹配。

@\S+|(_)

Regex demo

import re

strings = [
    "_",
    "lol_",
    "@username_",
    "lol@username_",
    "lol_@username",
    "lol_@username_"
]

for s in strings:
    result = re.sub(
        r"@\S+|(_)",
        lambda x: x.group(1).replace("_", r"\_") if x.group(1) else x.group(),
        s
    )
    print(result)

输出

\_
lol\_
@username_
lol@username_
lol\_@username
lol\_@username_

根据@OlvinR​​oght 的评论,稍作修改,这应该可以解决问题:

正则表达式

((?:^|\s)(?:[^@\s]*?))(_)((?:[^@\s]*?))(?=@|\s|$)

代码示例

import re

text = '_hi hello i like this char _ write me lol_ _word something_ @myusername_ something_@username_'

regex = r"((?:^|\s)(?:[^@\s]*?))(_)((?:[^@\s]*?))(?=@|\s|$)"

# Leave the first and last capturing group as-is and replace the underscore with '\_'
subst = "\1\\_\3"

print( re.sub(regex, subst, text) )

预期输出:

\_hi hello i like this char \_ write me lol\_ \_word something\_ @myusername_ something\_@username_

演示

See it live

注:

虽然这可行,但@TheFourthBird 的回答更快。 (我认为更优雅。)