R中的正则表达式:匹配节点词的搭配

Regex in R: match collocates of node word

我想在文本字符串中查找单词的搭配。一个词的搭配是那些在它之前或之后与它共同出现的词。这是一个虚构的例子:

GO <- c("This little sentence went on and on.", 
        "It was going on for quite a while.", 
        "In fact it has been going on for ages.", 
        "It still goes on.", 
        "It would go on even if it didn't.")

假设我对与引理 GO 搭配的单词感兴趣,包括动词 'go' 可以采用的所有形式,即 'go'、'went'、'gone'、'goes'和'going',我想使用stringr和assemble包中的str_extract提取GO左右两侧的搭配在数据框中并置。就 单词 搭配而言,这一切都很好。我可以这样做:

collocates <- data.frame(
  Left = str_extract(GO, "\w+\b\s(?=(go(es|ing|ne)?|went))"),
  Node = str_extract(GO, "go(es|ing|ne)?|went"),
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)\s\w+\b"))

这是结果:

collocates
       Left  Node Right
1 sentence   went    on
2      was  going    on
3     been  going    on
4    still   goes    on
5    would     go    on

但我不仅对 GO 前后的 一个 词感兴趣,而且对 GO 之前最多 三个 词感兴趣在GO之后。现在使用量词表达式让我更接近期望的结果,但还不够:

collocates <- data.frame(
  Left = str_extract(GO, "(\w+\b\s){0,3}(?=(go(es|ing|ne)?|went))"),
  Node = str_extract(GO, "go(es|ing|ne)?|went"),
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){0,3}"))

这就是现在的结果:

collocates
                   Left  Node       Right
1 This little sentence   went   on and on
2               It was  going            
3          it has been  going            
4             It still   goes            
5    It probably would     go  on even if

虽然左侧的搭配都符合要求,但右侧的搭配部分缺失。这是为什么?以及如何更改代码以正确匹配所有搭配?

预期输出:

                   Left  Node         Right
1 This little sentence   went     on and on
2               It was  going  on for quite
3          it has been  going   on for ages
4             It still   goes            on
5             It would     go    on even if

使用量词 {0,3}(表示前面标记的 0 和 3 之间的匹配)如果未达到最大值,将只允许跳过匹配组中的第一个单词。

r <- data.frame(
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){0,3}"))

Debuggex Demo

包括最小量词 1,您可以保证如果第一个匹配组右侧至少有一个词,那么它将被捕获。如果为零,它将跳过第一个单词并继续捕获剩余的任何内容,直到指定的最大值。

r <- data.frame(
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){1,3}"))

Debuggex Demo

这可以通过使用量词值并观察以下内容来进一步证明:

r <- data.frame(
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){2,2}"))

print(r)

     Right
1   on and
2   on for
3   on for
4     <NA>
5  on even

上面的例子我们选择了{2,2},(最少2个,最多2个);由于没有足够的单词来准确捕获第 4 行中的 2 个,我们得到 <NA>.