在 R 中使用环视从字符串中提取匹配项
Extracting matches from strings with lookaround in R
我有文本数据(讲故事),我的目标是提取由共现模式定义的某些词,即它们紧接在重叠之前出现,用方括号表示。数据是这样的:
who <- c("Sue:", NA, "Carl:", "Sue:", NA, NA, NA, "Carl:", "Sue:","Carl:", "Sue:","Carl:")
story <- c("That’s like your grand:ma. did that with::=erm ",
"with Ju:ne (.) once or [ twice.] ",
" [ Yeah. ] ",
"And June wanted to go out and yo- your granny said (0.8)",
"“make sure you're ba(hh)ck before midni(hh)ght.” ",
"[Mm.] ",
"[There] she was (.) a ma(h)rried woman with a(h)- ",
"She’s a right wally. ",
"mm [kids as well ] ",
" [They assume] an awful lot man¿ ",
"°°ye:ah,°° ",
"°°the elderly do.°° ")
CAt <- data.frame(who, story)
现在,定义模式:
pattern <- "\w.*\s\[[^]].*]"
并使用 grep():
grep(pattern, CAt$story, value = T)
[1] "with Ju:ne (.) once or [ twice.] "
[2] "mm [kids as well ] "
我得到了 包含 目标匹配的两个字符串,但我真正想要的只是目标词,在本例中是词 "or" 和 "mm"。对我来说,这似乎需要积极的前瞻性。所以我重新定义了模式:
pattern <- "\w.*(?=\s\[[^]].*])"
这句话的意思是:"match the word iff you see a space followed by square brackets with some content on the right of that word"。现在只提取完全匹配,我通常使用这段代码,只要不涉及环视,它就可以正常工作,但在这里它会抛出一个错误:
unlist(regmatches(CAt$story, gregexpr(pattern, CAt$story)))
Error in gregexpr(pattern, CAt$story) :
invalid regular expression, reason 'Invalid regexp'
这是为什么?以及如何提取精确匹配?
在您的代码中,您可以将 perl=TRUE
添加到 gregexpr。
在您的模式中 \w.*
将匹配单个单词 char,然后匹配任何 char 0+ 次。
这部分 \[[^]].*]
将匹配 [
,然后是 1 个不是 ]
的字符,然后 .*
将匹配任何字符 0+ 次,然后是 ]
.
您可以更新您的模式,改为重复单词 char 和字符 class 本身。
\w+(?=\s\[[^]]*])
说明
\w+
匹配 1+ 个单词字符
(?=
正向前瞻,断言直接在右边的是
\s
匹配单个空白字符
\[[^]]*]
使用 negated character class 从开始 [
到结束 ]
进行匹配
)
关闭正面前瞻
使用双反斜杠:
\w+(?=\s\[[^]]*])
作为替代方案,您可以使用捕获组而不是使用先行
(\w+)\s\[[^]]*]
我有文本数据(讲故事),我的目标是提取由共现模式定义的某些词,即它们紧接在重叠之前出现,用方括号表示。数据是这样的:
who <- c("Sue:", NA, "Carl:", "Sue:", NA, NA, NA, "Carl:", "Sue:","Carl:", "Sue:","Carl:")
story <- c("That’s like your grand:ma. did that with::=erm ",
"with Ju:ne (.) once or [ twice.] ",
" [ Yeah. ] ",
"And June wanted to go out and yo- your granny said (0.8)",
"“make sure you're ba(hh)ck before midni(hh)ght.” ",
"[Mm.] ",
"[There] she was (.) a ma(h)rried woman with a(h)- ",
"She’s a right wally. ",
"mm [kids as well ] ",
" [They assume] an awful lot man¿ ",
"°°ye:ah,°° ",
"°°the elderly do.°° ")
CAt <- data.frame(who, story)
现在,定义模式:
pattern <- "\w.*\s\[[^]].*]"
并使用 grep():
grep(pattern, CAt$story, value = T)
[1] "with Ju:ne (.) once or [ twice.] "
[2] "mm [kids as well ] "
我得到了 包含 目标匹配的两个字符串,但我真正想要的只是目标词,在本例中是词 "or" 和 "mm"。对我来说,这似乎需要积极的前瞻性。所以我重新定义了模式:
pattern <- "\w.*(?=\s\[[^]].*])"
这句话的意思是:"match the word iff you see a space followed by square brackets with some content on the right of that word"。现在只提取完全匹配,我通常使用这段代码,只要不涉及环视,它就可以正常工作,但在这里它会抛出一个错误:
unlist(regmatches(CAt$story, gregexpr(pattern, CAt$story)))
Error in gregexpr(pattern, CAt$story) :
invalid regular expression, reason 'Invalid regexp'
这是为什么?以及如何提取精确匹配?
在您的代码中,您可以将 perl=TRUE
添加到 gregexpr。
在您的模式中 \w.*
将匹配单个单词 char,然后匹配任何 char 0+ 次。
这部分 \[[^]].*]
将匹配 [
,然后是 1 个不是 ]
的字符,然后 .*
将匹配任何字符 0+ 次,然后是 ]
.
您可以更新您的模式,改为重复单词 char 和字符 class 本身。
\w+(?=\s\[[^]]*])
说明
\w+
匹配 1+ 个单词字符(?=
正向前瞻,断言直接在右边的是\s
匹配单个空白字符\[[^]]*]
使用 negated character class 从开始
[
到结束]
进行匹配)
关闭正面前瞻
使用双反斜杠:
\w+(?=\s\[[^]]*])
作为替代方案,您可以使用捕获组而不是使用先行
(\w+)\s\[[^]]*]