使用正则表达式来词干

Question

我正在尝试使用正则表达式来阻止文本中的单词。

c <- "Foo is down. No one wants Foos after this. Before, people liked Fooy a lot."

期望的输出：

"Foo is down. No one wants Foo after this. Before, people liked Foo a lot."

我需要保留单词 Foo，但删除该单词后面的所有字符，保留字符串的其余部分。

我设法从单词的基部拆分后缀，我可以删除单词变体后的所有内容 "Foo"，并且我尝试了单词边界，但无法弄清楚如何获得所需的输出.

Answer 1

解决此问题的一个可能的正则表达式将 "Foo with one or more letters after it" 替换为 "Foo":

> x = "Foo is down. No one wants Foos after this. Before, people liked Fooy a lot."
> stringr::str_replace_all(x, "Foo[a-z]+", "Foo")
[1] "Foo is down. No one wants Foo after this. Before, people liked Foo a lot."

Answer 2

我们可以尝试使用 gsub 并将模式 (?<=Foo)\S+ 替换为空字符串：

x <- "Foo is down. No one wants Foos after this. Before, people liked Fooy a lot."
output <- gsub("(?<=Foo)\S+", "", x, perl=TRUE)
output

[1] "Foo is down. No one wants Foo after this. Before, people liked Foo a lot."

使用正则表达式来词干

Stem the word by using regex

regex

r

gsub

Demo