匹配并替换 R 中字符串中拼写错误的单词

Question

我有一个短语列表，我想用相似的词替换其中的某些词，以防拼写错误。

library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"

一切正常，直到找到相似的字符串，并用另一个词替换它

如果我有如下模式：

"plea"，正确的是"please"，但是当我执行它时，它被删除并替换为"pleased"。

我要找的是，如果一个字符串已经正确，就不再修改，以防发现类似的模式。

Answer 1

也许您需要执行渐进式替换。例如你应该有多组 badwords 和 goodwords。首先替换为具有更多字母的 badwords，以便找不到匹配的模式，然后寻找更小的字母。

根据您提供的列表，我将创建 2 个集合：

goodwords1<-c( "three", "teasing") 
badwords1<- c("thre", "teeasing") 

goodwords2<-c("tree", "testing") 
badwords2<- c("tre", "tesing")

先替换为第 1 组，然后替换为第 2 组。您可以创建许多这样的集合。

Answer 2

str_replace_all 以正则表达式为模式，因此您可以 paste0 单词边界 \b 围绕每个 badwords 以便仅在整个单词时才进行替换匹配：

library(stringr)
string <- c("tre", "tree", "teeasing", "tesing") 
goodwords <- c("tree", "three", "teasing", "testing") 
badwords <- c("tre", "thre", "teeasing", "tesing") 

# Paste word boundaries around badwords
badwords <- paste0("\b", badwords, "\b")

vect.corpus <- goodwords 
names(vect.corpus) <- badwords 

str_replace_all(string, vect.corpus) 
[1] "tree"    "tree"    "teasing" "testing"

这样做的好处是您不必跟踪哪些字符串是较长的字符串。

这是 badwords 粘贴后的样子：

> badwords
[1] "\btre\b"      "\bthre\b"     "\bteeasing\b" "\btesing\b"

匹配并替换 R 中字符串中拼写错误的单词

Match and replace misspelled words in a string in R

regex

string

text-processing

r

text-mining