如何使用正则表达式查找 curly-braces 中的所有 curly-braces?

How to use a regex to find all curly-braces inside curly-braces?

我正在使用 Zotero 从 PDF 创建一个 BibTeX 参考列表,它使用 { } 来包围必须保留大小写的单词。

title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},

然而,我的团队中有些人使用 Mendeley,它似乎不知道 BibTeX 格式的这一规则,并且在从我发送的 BibTeX 文件导入后,{ } 仍然出现在他们的标题中。

所以我想写一个小脚本(在 R 中)以删除标题(和其他字段)的主要 {} 内的 {},以便上面的行在修改后的文件中变为下面。

title = {Novel breeding habitat, oviposition microhabitat, and parental care in Bokermannohyla caramaschii (Anura: Hylidae) in southeastern Brazil},

我已经尝试了很多,但没有任何效果。执行此操作的正则表达式是什么?

如果我们可以确定“%%%”和“###”字符串不会出现在标题中,那么这是一个有效的策略。首先我们把第一个“{”改成“%%%”,最后一个“}”改成“###”。然后把“{”和“}”全部去掉,然后把第一个“{”和最后一个“}”放回去。

txt <- "title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},"
txt2 <- sub("(^[^{]+)(\{)", "\1%%%", txt) # placeholder for first "{"
txt3 <- sub("(\})([^}]*$)", "###\2", txt2) #  "    "     for last "}"
txt4 <- gsub("\{|\}", "", txt3) # remove the rest
txt5 <- sub("%%%", "{", tx4) # put the leading and trailing ones back
txt6 <- sub("###", "}", txt5)
txt6
[1] "title = {Novel breeding habitat, oviposition microhabitat, and parental care in Bokermannohyla caramaschii (Anura: Hylidae) in southeastern Brazil},"

这是一个仅删除 {} 的解析器,并且仅当位于 { ... } 的完整集合中时。它并不假装快速或高效,但使用 reasonable-length 个字符串,您应该不会注意到任何延迟。

func <- function(S) {
  spl <- strsplit(S, "")[[1]]
  out <- character(0)
  inbrace <- 0L
  for (i in seq_along(spl)) {
    ch <- spl[i]
    if (ch == "{") {
      if (inbrace < 1L) out <- c(out, ch)
      inbrace <- inbrace + 1L
    } else if (ch == "}") {
      if (inbrace == 0L) {
        stop("unmatched close brace at: ", i)
      } else if (inbrace == 1L) {
        out <- c(out, ch)
      }
      inbrace <- max(0L, inbrace - 1L)
    } else out <- c(out, ch)
  }
  if (inbrace != 0L) stop("finished missing ", inbrace, " close-brace(s)")
  paste(out, collapse = "")
}

演示:

func('title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},')
# [1] "title = {Novel breeding habitat, oviposition microhabitat, and parental care in Bokermannohyla caramaschii (Anura: Hylidae) in southeastern Brazil},"

它试图非常具体,如果出现不匹配的 } 或输入结束而 { 仍然不匹配则失败。

func('title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil},')
# Error in func("title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil},") : 
#   finished missing 1 close-brace(s)

func('title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla}} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},')
# Error in func("title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla}} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},") : 
#   unmatched close brace at: 156

您可以转换正则表达式的匹配项

(?<!^title = ){|}(?!,$)

到空字符串(perl=TRUE)。

Demo

正则表达式可以分解如下。 (我将 spaces 显示为包含 space 的字符 类,以便 reader 可以看到它们。)

(?<!            # begin a negative lookbehind
  ^             # match the start of the string 
  title[ ]=[ ]  # match 'title = '
)               # end negative lookbehind
{               # match '{'
|               # or
}               # match '}'
(?!             # begin a negative lookahead
  ,$            # match a comma at the end of the string
)               # end a negative lookahead