删除括号内的非数字字符

Question

我想删除特定括号内的非数字字符，并删除该行中的 other 括号。看下面的例子；

text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)
John Doe (John.Doe@company22.com)",
          "1110383 Project something 11/22/2019 ASP (890212-wso)
John Doe (John.Doe@company22.com)
Other Stuff",
          "1110383 Project something SD (890212)
John Doe (John.Doe@company22.com)")

预期输出为：

cat(paste0(myoutxt, collapse = "\n"))
# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe (John.Doe@company22.com)
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe (John.Doe@company22.com)
# 1110383 Project something SD (890212)
# John Doe (John.Doe@company22.com)

我想出了一个与我的 5 位或 6 位数字相匹配的正则表达式，但我不确定应该替换什么。另外我认为应该修改以下内容，因为它不考虑可能存在的其他括号来删除它们。

^.*?\([^\d]*(\d{5,6})[^\d]*\).*$

逻辑：

基本上，我希望找到括号之间带有 5-6 位数字（例如 89021 或 890212）的行。然后，如果括号内还有其他内容，我想删除它们（例如 -design 或 -wso）。最后，如果该特定行中还有其他括号（例如 (WSO)），我希望删除括号而不是单词。

Answer 1

如何替换

(?:\(([^)\d]+)\)(.*?))?\([^\d)]*(\d{5,6})[^\d)]*\)

至

()

(?:\(([^)\d]+)\)(.*?))? 第一个 optional part captures </code> 之前任何带括号的内容。在括号中的 5-6 位数字部分被捕获到 <code>
\([^\d)]*(\d{5,6})[^\d)]*\)第二部分截取5-6位数字到</code></li> </ul> <p><a href="https://regex101.com/r/C1Dflp/4" rel="nofollow noreferrer">See the demo at regex101</a></p> <hr> <p>在 <a href="/questions/tagged/r" class="post-tag" title="show questions tagged 'r'" rel="tag">r</a> 使用 <code>gsub:
```
gsub(pattern='(?:\(([^)\d]+)\)(.*?))?\([^\d)(]*(\d{5,6})[^\d)(]*\)', 
         replacement='\1\2(\3)', 
         x=text, 
         perl=TRUE, fixed = FALSE)
```

Answer 2

这是你想要的吗？

"\(([^0-9@]*)\)"：删除任何不包含数字或 @
"\((\d{5,6}).*\)"：对于包含 5 到 6 个数字 + 其他任何内容的括号，只保留数字。

我假设另一组括号总是包含电子邮件地址。

library(stringr)

cat(
  paste0(
    str_replace(
      str_replace(text, "\(([^0-9@]*)\)", "\1"), 
      "\((\d{5,6}).*\)", 
      "\1"), 
    collapse = "\n"
  )
)

# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe (John.Doe@company22.com)
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe (John.Doe@company22.com)
# Other Stuff
# 1110383 Project something SD (890212)
# John Doe (John.Doe@company22.com)

Answer 3

这是横向方法

fun_0 <- function(string) {
  vec <- strsplit(string, '\(|\)', perl = TRUE)[[1L]]
  s <- ifelse(startsWith(string, '('), 1L, 2L)
  e <- length(vec)
  if (s > e)
    return(vec)
  inside_brackets <- seq(s, e, 2L)
  vec[inside_brackets] <- gsub('\D*(\d{4,5})\D*', '(\1)', vec[inside_brackets])  
  paste(vec, collapse = '')  
}
fun_1 <- function(string_vec) {
  to_process <- grepl('\d{4,}', string_vec)
  string_vec[to_process] <- vapply(string_vec[to_process], fun_0, character(1))
  paste(string_vec, collapse = '\n')
}
fun_2 <- function(text) {
    string_list <- strsplit(text, '\n')
    vapply(string_list, fun_1, character(1))
}

例子

text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)\nJohn Doe (John.Doe@company22.com)",
          "1110383 Project something 11/22/2019 ASP (890212-wso)\nJohn Doe (John.Doe@company22.com)\nOther Stuff",
          "1110383 Project something SD (890212)\nJohn Doe (John.Doe@company22.com)")
fun_2(text)
# [1] "1110383 Project something 11/22/2019 WSO (89021)\nJohn Doe (John.Doe@company22.com)"                  
# [2] "1110383 Project something 11/22/2019 ASP (89021)2-wso\nJohn Doe (John.Doe@company22.com)\nOther Stuff"
# [3] "1110383 Project something SD (89021)2\nJohn Doe (John.Doe@company22.com)"

删除括号内的非数字字符

Remove non-numeric characters within parantheses

regex

r

gsub

逻辑：