将子字符串与引号之间的一个或多个大写单词的正则表达式匹配

Question

我有以下字符串：

example_string <- "In this document, Defined Terms are quotation marks, followed by definition. \"Third Party Software\" is software owned by third parties. \"USA\" the United States of America. \"Breach of Contract\" is in accordance with the Services Description."

我想提取每个至少部分大写并被引号夹在中间的子字符串。所以输出应该是：

"Third Party Software"  "USA"  "Breach of Contract"

我用正则表达式做到了这一点：

str_extract_all(example_string, "(?:\")\w(\s*\w+)*")

[[1]]
[1] "\"Third Party Software" "\"USA"                  "\"Breach of Contract"

我想不出避免匹配左转义引号 \" 的方法。我知道我可以在提取定义的术语后添加一个 gsub 行来清除它，但我认为必须有一种方法可以在一个正则表达式调用中完成所有操作。

非常感谢任何建议！

Answer 1

在您的表达式 (?:")\w(\s*\w+)*" 中，您使用 non-capturing (?:") 组匹配并消耗 " 字符。因此，它落在匹配值中。

您可能想使用

"(?<=\")\w(\s*\w+)*"

其中 (?<=") 是匹配 location 的正后视，后者紧接 " 字符。

但是，当你有相同的单字符左右分隔符时，我宁愿使用捕获方法。

您可以将 stringr::str_match_all 与

一起使用

"(\p{Lu}[^"]*)"

或者，也可以是你的模式，稍微修改一下：

"(\p{Lu}\w*(?:\s+\w+)*)"

参见regex demo, or this demo。详情：

" - 一个 " 字符
(\p{Lu}[^"]*) - 捕获组 1：
- \p{Lu} - 任何 Unicode 大写字母
- [^"]* - "
\w*(?:\s+\w+)* - 0+ 个字母、数字、下划线，然后出现 0+ 个 1+ 个空格，后跟 1+ 个字母、数字、下划线
" - 一个 " 字符。

参见 R demo online:

library(stringr)
example_string <- "In this document, Defined Terms are quotation marks, followed by definition. \"Third Party Software\" is software owned by third parties. \"USA\" the United States of America. \"Breach of Contract\" is in accordance with the Services Description."
res <- str_match_all(example_string, '"(\p{Lu}[^"]*)"')
unlist(lapply(res, function(x) x[,-1]))
## => [1] "Third Party Software" "USA"                  "Breach of Contract"

将子字符串与引号之间的一个或多个大写单词的正则表达式匹配

Match substring with regex for one or more capitalised words between quotation marks

regex

r

stringr