从句子中去除乱码
removing gibberish from sentences
在文本清理过程中,是否可以从句子中检测并删除这样的垃圾:
x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")
目前我正在做这样的事情:
str_detect(x, pattern = 'Thisisaverylongexample'))
但我查看数据框的次数越多,我发现这种垃圾的句子越多。我如何使用正则表达式之类的东西来检测和删除带有此类垃圾的行?
如果 'junk' 可以通过其不寻常的长度检测到,您可以相应地定义规则。例如,如果你想去掉 10 个或更多字符的单词,这将提取它们:
library(stringr)
str_extract_all(x, "\b\w{10,}\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"
[[2]]
[1] "thisisjustjunk"
这将摆脱它们:
trimws(gsub("\b\w{10,}\b", "", x))
[1] "and I was to" "but I do I remove it"
数据:
x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")
在文本清理过程中,是否可以从句子中检测并删除这样的垃圾:
x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")
目前我正在做这样的事情:
str_detect(x, pattern = 'Thisisaverylongexample'))
但我查看数据框的次数越多,我发现这种垃圾的句子越多。我如何使用正则表达式之类的东西来检测和删除带有此类垃圾的行?
如果 'junk' 可以通过其不寻常的长度检测到,您可以相应地定义规则。例如,如果你想去掉 10 个或更多字符的单词,这将提取它们:
library(stringr)
str_extract_all(x, "\b\w{10,}\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"
[[2]]
[1] "thisisjustjunk"
这将摆脱它们:
trimws(gsub("\b\w{10,}\b", "", x))
[1] "and I was to" "but I do I remove it"
数据:
x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")