从句子中去除乱码

Question

在文本清理过程中，是否可以从句子中检测并删除这样的垃圾：

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")

目前我正在做这样的事情：

str_detect(x, pattern = 'Thisisaverylongexample'))

但我查看数据框的次数越多，我发现这种垃圾的句子越多。我如何使用正则表达式之类的东西来检测和删除带有此类垃圾的行？

Answer 1

如果 'junk' 可以通过其不寻常的长度检测到，您可以相应地定义规则。例如，如果你想去掉 10 个或更多字符的单词，这将提取它们：

library(stringr)
str_extract_all(x, "\b\w{10,}\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"           

[[2]]
[1] "thisisjustjunk"

这将摆脱它们：

trimws(gsub("\b\w{10,}\b", "", x))
[1] "and I was to"         "but I do I remove it"

数据：

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")

removing gibberish from sentences