在 R 中使用环视从字符串中提取匹配项

Question

我有文本数据（讲故事），我的目标是提取由共现模式定义的某些词，即它们紧接在重叠之前出现，用方括号表示。数据是这样的：

who <- c("Sue:", NA, "Carl:", "Sue:", NA, NA, NA, "Carl:", "Sue:","Carl:", "Sue:","Carl:")
story <- c("That’s like your grand:ma. did that with::=erm          ",
       "with Ju:ne (.) once or [ twice.]                        ",
       "                       [ Yeah. ]                        ",
       "And June wanted to go out and yo- your granny said (0.8)",
       "“make sure you're ba(hh)ck before midni(hh)ght.”        ",
       "[Mm.]                                                   ",
       "[There] she was (.) a ma(h)rried woman with a(h)-       ",
       "She’s a right wally.                                    ",
       "mm [kids  as well ]                                     ",
       "   [They    assume] an awful lot man¿                   ",
       "°°ye:ah,°°                                              ",
      "°°the elderly do.°°                                      ")
CAt <- data.frame(who, story)

现在，定义模式：

pattern <- "\w.*\s\[[^]].*]"

并使用 grep():

grep(pattern, CAt$story, value = T)
[1] "with Ju:ne (.) once or [ twice.]                        "
[2] "mm [kids  as well ]                                     "

我得到了包含目标匹配的两个字符串，但我真正想要的只是目标词，在本例中是词 "or" 和 "mm"。对我来说，这似乎需要积极的前瞻性。所以我重新定义了模式：

pattern <- "\w.*(?=\s\[[^]].*])"

这句话的意思是："match the word iff you see a space followed by square brackets with some content on the right of that word"。现在只提取完全匹配，我通常使用这段代码，只要不涉及环视，它就可以正常工作，但在这里它会抛出一个错误：

unlist(regmatches(CAt$story, gregexpr(pattern, CAt$story)))
Error in gregexpr(pattern, CAt$story) : 
invalid regular expression, reason 'Invalid regexp'

这是为什么？以及如何提取精确匹配？

Answer 1

在您的代码中，您可以将 perl=TRUE 添加到 gregexpr。

在您的模式中 \w.* 将匹配单个单词 char，然后匹配任何 char 0+ 次。

这部分 \[[^]].*] 将匹配 [，然后是 1 个不是 ] 的字符，然后 .* 将匹配任何字符 0+ 次，然后是 ].

您可以更新您的模式，改为重复单词 char 和字符 class 本身。

\w+(?=\s\[[^]]*])

说明

\w+ 匹配 1+ 个单词字符
(?=正向前瞻，断言直接在右边的是
- \s 匹配单个空白字符
- \[[^]]*] 使用 negated character class
) 关闭正面前瞻

Regex demo

使用双反斜杠：

\w+(?=\s\[[^]]*])

作为替代方案，您可以使用捕获组而不是使用先行

(\w+)\s\[[^]]*]

Regex demo

在 R 中使用环视从字符串中提取匹配项

Extracting matches from strings with lookaround in R

regex

r

regex-lookarounds