R中的正则表达式:匹配节点词的搭配
Regex in R: match collocates of node word
我想在文本字符串中查找单词的搭配。一个词的搭配是那些在它之前或之后与它共同出现的词。这是一个虚构的例子:
GO <- c("This little sentence went on and on.",
"It was going on for quite a while.",
"In fact it has been going on for ages.",
"It still goes on.",
"It would go on even if it didn't.")
假设我对与引理 GO 搭配的单词感兴趣,包括动词 'go' 可以采用的所有形式,即 'go'、'went'、'gone'、'goes'和'going',我想使用stringr
和assemble包中的str_extract
提取GO左右两侧的搭配在数据框中并置。就 单词 搭配而言,这一切都很好。我可以这样做:
collocates <- data.frame(
Left = str_extract(GO, "\w+\b\s(?=(go(es|ing|ne)?|went))"),
Node = str_extract(GO, "go(es|ing|ne)?|went"),
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)\s\w+\b"))
这是结果:
collocates
Left Node Right
1 sentence went on
2 was going on
3 been going on
4 still goes on
5 would go on
但我不仅对 GO 前后的 一个 词感兴趣,而且对 GO 之前最多 三个 词感兴趣在GO之后。现在使用量词表达式让我更接近期望的结果,但还不够:
collocates <- data.frame(
Left = str_extract(GO, "(\w+\b\s){0,3}(?=(go(es|ing|ne)?|went))"),
Node = str_extract(GO, "go(es|ing|ne)?|went"),
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){0,3}"))
这就是现在的结果:
collocates
Left Node Right
1 This little sentence went on and on
2 It was going
3 it has been going
4 It still goes
5 It probably would go on even if
虽然左侧的搭配都符合要求,但右侧的搭配部分缺失。这是为什么?以及如何更改代码以正确匹配所有搭配?
预期输出:
Left Node Right
1 This little sentence went on and on
2 It was going on for quite
3 it has been going on for ages
4 It still goes on
5 It would go on even if
使用量词 {0,3}
(表示前面标记的 0 和 3 之间的匹配)如果未达到最大值,将只允许跳过匹配组中的第一个单词。
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){0,3}"))
包括最小量词 1,您可以保证如果第一个匹配组右侧至少有一个词,那么它将被捕获。如果为零,它将跳过第一个单词并继续捕获剩余的任何内容,直到指定的最大值。
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){1,3}"))
这可以通过使用量词值并观察以下内容来进一步证明:
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){2,2}"))
print(r)
Right
1 on and
2 on for
3 on for
4 <NA>
5 on even
上面的例子我们选择了{2,2}
,(最少2个,最多2个);由于没有足够的单词来准确捕获第 4 行中的 2 个,我们得到 <NA>
.
我想在文本字符串中查找单词的搭配。一个词的搭配是那些在它之前或之后与它共同出现的词。这是一个虚构的例子:
GO <- c("This little sentence went on and on.",
"It was going on for quite a while.",
"In fact it has been going on for ages.",
"It still goes on.",
"It would go on even if it didn't.")
假设我对与引理 GO 搭配的单词感兴趣,包括动词 'go' 可以采用的所有形式,即 'go'、'went'、'gone'、'goes'和'going',我想使用stringr
和assemble包中的str_extract
提取GO左右两侧的搭配在数据框中并置。就 单词 搭配而言,这一切都很好。我可以这样做:
collocates <- data.frame(
Left = str_extract(GO, "\w+\b\s(?=(go(es|ing|ne)?|went))"),
Node = str_extract(GO, "go(es|ing|ne)?|went"),
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)\s\w+\b"))
这是结果:
collocates
Left Node Right
1 sentence went on
2 was going on
3 been going on
4 still goes on
5 would go on
但我不仅对 GO 前后的 一个 词感兴趣,而且对 GO 之前最多 三个 词感兴趣在GO之后。现在使用量词表达式让我更接近期望的结果,但还不够:
collocates <- data.frame(
Left = str_extract(GO, "(\w+\b\s){0,3}(?=(go(es|ing|ne)?|went))"),
Node = str_extract(GO, "go(es|ing|ne)?|went"),
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){0,3}"))
这就是现在的结果:
collocates
Left Node Right
1 This little sentence went on and on
2 It was going
3 it has been going
4 It still goes
5 It probably would go on even if
虽然左侧的搭配都符合要求,但右侧的搭配部分缺失。这是为什么?以及如何更改代码以正确匹配所有搭配?
预期输出:
Left Node Right
1 This little sentence went on and on
2 It was going on for quite
3 it has been going on for ages
4 It still goes on
5 It would go on even if
使用量词 {0,3}
(表示前面标记的 0 和 3 之间的匹配)如果未达到最大值,将只允许跳过匹配组中的第一个单词。
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){0,3}"))
包括最小量词 1,您可以保证如果第一个匹配组右侧至少有一个词,那么它将被捕获。如果为零,它将跳过第一个单词并继续捕获剩余的任何内容,直到指定的最大值。
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){1,3}"))
这可以通过使用量词值并观察以下内容来进一步证明:
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\s\w+\b){2,2}"))
print(r)
Right
1 on and
2 on for
3 on for
4 <NA>
5 on even
上面的例子我们选择了{2,2}
,(最少2个,最多2个);由于没有足够的单词来准确捕获第 4 行中的 2 个,我们得到 <NA>
.