正则表达式字符串检测，除非使用 R 中的环视检测到特定模式

Question

我正在寻找突出显示字符串中的模式，除非使用正则表达式也检测到另一个模式以供以后的字符串处理。最后，我希望用 space 替换所有 hat 或 hats 除非在字符串中找到 what 或 no 但我正在使用检测模式试验。

数据：

require(tidyverse)

trial.string<-c("Hat","coif","hatter","HATS","plushy","no hat","what","hat no","hats","HAT, what","what, hat, no, hats","A water hat")

到目前为止，我已经尝试在 str_view_all 中使用以下模式来检查它是否有效。我为 ignore_case = TRUE 选项使用了 regex 函数。

trial.string %>% 
  str_view_all(regex("(?<!w)hat(s)*(?!.*(what|no))"
                     ,ignore_case = TRUE))

这导致：

最终结果应该是排除第六个字符串 no hat 和第十一个字符串 what, hat, no, hats 被检测到。

我不确定我是否以正确的方式使用 lookarounds 来让它工作，或者我对 regex 函数的使用是错误的。

Answer 1

您可以使用以下 ICU 正则表达式，假设您在 no 和 what 字和 hat 字之间不能有超过 1K 个字符：

stringr::str_replace_all(trial.string, 
      "(?i)(?<!\b(?:no|what)\b.{0,1000})\bhats?(?!.*\b(?:no|what)\b)", " ")

参见regex demo。

它匹配：

(?i) - 不区分大小写模式开启
(?<!\b(?:no|what)\b.{0,1000}) - 如果整个单词 no 或 what 后跟除换行符以外的任何零到 1000 个字符，则匹配失败的否定后视可能
\bhat - 单词边界和 hat 字符串
s? - 一个可选的 s
(?!.*\b(?:no|what)\b) - 如果除换行符以外的任何零个或多个字符尽可能多 (.*) 后跟整个单词 [=13]，则否定前瞻会导致匹配失败=] 或 what.

看到一个 R demo online:

trial.string<-c("Hat","coif","hatter","HATS","plushy","no hat","what","hat no","hats","HAT, what","what, hat, no, hats","A water hat")
stringr::str_replace_all(trial.string, 
      "(?i)(?<!\b(?:no|what)\b.{0,1000})\bhats?(?!.*\b(?:no|what)\b)", " ")

输出：

[1] " "                   "coif"                " ter"               
 [4] " "                   "plushy"              "no hat"             
 [7] "what"                "hat no"              " "                  
[10] "HAT, what"           "what, hat, no, hats" "A water  "

正则表达式字符串检测，除非使用 R 中的环视检测到特定模式

Regex string detection unless a specific pattern is detected using lookarounds in R

regex

r

stringr

regex-lookarounds

tidyverse