我怎样才能找到复合词,删除它们之间的空格并在我的语料库中替换它们?

How can I find compound words, removing spaces between them and replace them in my corpus?

我有很多复合词,例如 hello World、good Morning、good Night...我想在我的语料库中找到它们,然后将它们替换为 helloWorld、goodMorning、goodNight。所以通过这种方式我可以保留他们的概念。 我可以一个一个地做,但是它非常乏味,因为有很多复合术语。我需要用 R 语言来做这个。

如果您所有的复合词仅由空格分隔,您可以使用 gsub:

> x = c("hello World", "good Morning", "good Night")
> y = gsub(pattern = " ", replacement = "", x = x)
> print(y)
[1] "helloWorld"  "goodMorning" "goodNight"  

您可以随时向 pattern 参数添加更多模式。详细了解 R here and here.

中的正则表达式

编辑

@user4241750: True, but I only want to do this for particular compound terms(There are many) not all the terms in the corpus since there are many other terms in the corpus

如果您知道要更改的所有特定复合词,可以在 docs[[j]] 上指定。假设您要更改的唯一条件是 "simple parts" 和 "good morning":

terms.to.change = c("simple parts","good morning")
for (j in seq(corpus)) {
  positions.to.change = which(docs[[j]] %in% terms.to.change)
  docs[[j]][positions.to.change] <- gsub(" ", "", docs[[j]][positions.to.change])
}