R:使用所需子字符串列表清理字符串
R: Cleaning a string using a list of wanted substrings
我有一个带有字符串的数据框
Clause <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High', 'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
Input <- data.frame(Clause)
我想通过仅保留在清理列表中找到的子字符串来清理该字符串;
Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')
想要的结果如下。
Wanted <- c('Pb Anomaly Low', 'Fe Anomaly High', 'Pb Anomaly High & Fe Anomaly Low')
Result <- data.frame(Wanted)
注意:'Keepers' 列表还将包含 'SomethingNotPresent'
等项目
您可以形成白名单条款的正则表达式替换以保留。然后使用否定前瞻模式来识别所有应该删除的 terms/whitespace:
alternation <- paste(Keepers, collapse="|")
regex <- paste0("\s*(?!(?:", alternation, "))(?<!\S)\S+(?!\S)\s*")
df$clause <- gsub("\s+", " ", trimws(gsub(regex, " ", df$clause, perl=TRUE)))
df
clause
1 Pb Anomaly Low
2 Fe Anomaly High
3 Pb Anomaly High & Fe Anomaly Low
数据:
inp <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High',
'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
df <- data.frame(clause=inp, stringsAsFactors=FALSE)
Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')
您可以在每个单词处拆分字符串,每行仅保留 Keepers
个单词。
sapply(strsplit(Input$Clause, '\s+'), function(x)
paste0(x[x %in% Keepers], collapse = ' '))
#[1] "Pb Anomaly Low" "Fe Anomaly High" "Anomaly High & Fe Anomaly Low"
我有一个带有字符串的数据框
Clause <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High', 'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
Input <- data.frame(Clause)
我想通过仅保留在清理列表中找到的子字符串来清理该字符串;
Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')
想要的结果如下。
Wanted <- c('Pb Anomaly Low', 'Fe Anomaly High', 'Pb Anomaly High & Fe Anomaly Low')
Result <- data.frame(Wanted)
注意:'Keepers' 列表还将包含 'SomethingNotPresent'
等项目您可以形成白名单条款的正则表达式替换以保留。然后使用否定前瞻模式来识别所有应该删除的 terms/whitespace:
alternation <- paste(Keepers, collapse="|")
regex <- paste0("\s*(?!(?:", alternation, "))(?<!\S)\S+(?!\S)\s*")
df$clause <- gsub("\s+", " ", trimws(gsub(regex, " ", df$clause, perl=TRUE)))
df
clause
1 Pb Anomaly Low
2 Fe Anomaly High
3 Pb Anomaly High & Fe Anomaly Low
数据:
inp <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High',
'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
df <- data.frame(clause=inp, stringsAsFactors=FALSE)
Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')
您可以在每个单词处拆分字符串,每行仅保留 Keepers
个单词。
sapply(strsplit(Input$Clause, '\s+'), function(x)
paste0(x[x %in% Keepers], collapse = ' '))
#[1] "Pb Anomaly Low" "Fe Anomaly High" "Anomaly High & Fe Anomaly Low"