从句子中去除乱码

removing gibberish from sentences

在文本清理过程中,是否可以从句子中检测并删除这样的垃圾:

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")

目前我正在做这样的事情:

str_detect(x, pattern = 'Thisisaverylongexample'))

但我查看数据框的次数越多,我发现这种垃圾的句子越多。我如何使用正则表达式之类的东西来检测和删除带有此类垃圾的行?

如果 'junk' 可以通过其不寻常的长度检测到,您可以相应地定义规则。例如,如果你想去掉 10 个或更多字符的单词,这将提取它们:

library(stringr)
str_extract_all(x, "\b\w{10,}\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"           

[[2]]
[1] "thisisjustjunk"

这将摆脱它们:

trimws(gsub("\b\w{10,}\b", "", x))
[1] "and I was to"         "but I do I remove it"

数据:

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")