正则表达式替换 R 中字符串的 parts/groups

Question

尝试对 bookdown 文档的 LaTeX（pdf_book 输出）进行后处理以折叠 biblatex 引用，以便稍后使用 \usepackage[sortcites]{biblatex} 按时间顺序对它们进行排序。因此，我需要在 \autocites 之后找到 }{ 并将其替换为 ,。我正在试验 gsub() 但找不到正确的咒语。

# example input
testcase <- "text \autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"

# desired output
"text \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

一个简单的方法是替换所有 }{

> gsub('\}\{', ',', testcase, perl=TRUE)
[1] "text \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"

但这也崩了{keep}{separate}。

然后我试图通过使用不同的组来替换 'word'（没有空格的字符串）中的 }{，但失败了：

> gsub('(\\autocites)([^ \f\n\r\t\v}{}]+)((\}\{})+)', '\1\2\3', testcase, perl=TRUE)
[1] "text \autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"

附录：实际文档包含的内容比上面的测试用例多lines/elements。并非所有元素都包含 \autocites，在极少数情况下，一个元素包含多个 \autocites。我最初并不认为这是相关的。更真实的测试用例：

testcase2 <- c("some text",
"text \autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \autocites[cf.~][]{foxMapping2000}{wattPattern1947}")

Answer 1

这不是最漂亮的解决方案，但它确实有效。这会重复将 }{ 替换为，但前提是它跟在 autocities 之后且中间没有空格。

while(length(grep('(autocites\S*)\}\{', testcase, perl=TRUE))) {
    testcase = sub('(autocites\S*)\}\{', '\1,', testcase, perl=TRUE)
}

testcase
[1] "text \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

Answer 2

我会让输入字符串稍微大一点，以使算法更清晰。

str <- "
text \autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}
text \autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990} text {keep}{separate}
"

我们将首先提取所有引用块，将其中的"}{"替换为","，然后将它们放回字符串中。

# pattern for matching citation blocks
pattern <- "\\autocites(\[[^\[\]]*\])*(\{[[:alnum:]]*\})+"
cit <- str_extract_all(str, pattern)[[1]]
cit

#> [1] "\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990}"
#> [2] "\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990}"

在引用块中替换：

newcit <- str_replace_all(cit, "\}\{", ",")
newcit
#> [1] "\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [2] "\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"

在发现引用块的地方打破原始字符串

strspl <- str_split(str, pattern)[[1]]
strspl
#> [1] "\ntext "  " text {keep}{separate}\ntext "  " text {keep}{separate}\n"

插入修改后的引用块：

combined <- character(length(strspl) + length(newcit))
combined[c(TRUE, FALSE)] <- strspl
combined[c(FALSE, TRUE)] <- newcit
combined
#> [1] "\ntext "                                                          
#> [2] "\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [3] " text {keep}{separate}\ntext "                                    
#> [4] "\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [5] " text {keep}{separate}\n"

粘贴在一起完成：

newstr <- paste(combined, collapse = "")
newstr
#> [1] "\ntext \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\ntext \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\n"

我怀疑可能有基于相同想法的更优雅的全正则表达式解决方案，但我找不到。

Answer 3

我找到了一个有效的咒语。不好看:

gsub("\\autocites[^ ]*",
  gsub("\}\{",",",
    gsub(".*(\\autocites[^ ]*).*","\\\1",testcase) #all those extra backslashes are there because R is ridiculous.
    ),
  testcase)

我把它分成几行希望让它更容易理解。基本上，最里面的 gsub 只提取自动引用（\autocites 之后的任何内容，直到第一个 space），然后中间的 gsub 将 }{ 替换为逗号，最外面的gsub用最里面的提取的模式替换中间的结果。

当然，这只适用于字符串中的单个自动引用。

此外，fortune(365)。

Answer 4

一次 gsub 调用就足够了：

gsub("(?:\G(?!^)|\\autocites)\S*?\K}{", ",", testcase, perl=TRUE)
## => [1] "text \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

见regex demo。在这里，(?:\G(?!^)|\autocites) 匹配前一个匹配项或 \autocites 字符串的末尾，然后它匹配任何 0 个或多个非空白字符，但尽可能少，然后 \K 丢弃来自的文本当前匹配缓冲区并使用 }{ 最终被逗号替换的子字符串。

还有一个非常易读的解决方案，其中包含一个正则表达式和一个使用 stringr::str_replace_all:

的固定文本替换

library(stringr)
str_replace_all(testcase, "\\autocites\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

此处，\autocites\S+ 匹配 \autocites，然后匹配 1+ 个非空白字符，gsub("}{", ",", x, fixed=TRUE) 将每个 }{ 替换（非常快）为 ,在匹配的文本中。

正则表达式替换 R 中字符串的 parts/groups

regex replace parts/groups of a string in R

regex

r

gsub

capture-group