从字符串最内层的嵌套括号中提取文本
Extract text from inner-most nested parentheses of string
我试图从下面的文本字符串中提取特定的字符串子集。
string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)",
"scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)",
"I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)",
"scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)",
"scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)",
"I(scale(Slope_30)^2)")
一个好的结果是 return 中央文本没有任何特殊字符,如下所示。
Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope",
"SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope")
但是,生成的字符串最好分别说明与 'Slope' 和 'SlopeVar' 关联的 ^2
和 log
。具体来说,所有包含 ^2
的字符串将被转换为 'SlopeSq',所有包含 log
的字符串将被转换为 'SlopeVarPs',如下所示。
Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
"SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq")
我有一个冗长、丑陋且低效的代码序列,使我接近一半的好结果,我将不胜感激任何建议。
作为一个效率不高的编码员,我喜欢用多个正则表达式链来实现结果(每行正则表达式的作用在每一行中都有注释):
library(stringr)
library(dplyr)
string %>%
str_replace_all(".*log\((.*?)(_.+?)?\).*", "\1Ps") %>% # deal with "log" entry
str_replace_all(".*\((.*?\))", "\1") %>% # delete anything before the last "("
str_replace_all("(_\d+)?\)\^2", "Sq") %>% # take care of ^2
str_replace_all("(_.+)?\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")")
Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
"SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq")
all(outcome == Best)
## TRUE
我认为这可以通过包 stringr 来实现。
首先,因为您想要最里面的括号内的 "central text"。因此,下面的正则表达式排除了括号内包含括号的任何文本。但是我保留了 "log/^2" 以备后用。
string_step <- str_extract(string,
"(log|)\([^()]+\)(\^2|)")
然后我注意到下划线之后的任何内容都被截断了,但只保留了字母(和数字)的短语。 Unlike \w (\w in R), which includes underscore, "[:alnum:]+" equals "[A-Za-z0-9]+",因此被使用。
GoodMy <-
str_extract(str_replace_all(string_step, "log|\(|\)|\^2", ""),
"[:alnum:]+")
BestMy <-
paste0(Good, as.character(sapply(string_step, function(x) {
if (str_detect(x, "log")) {
"Ps"
} else if (str_detect(x, "\^2")) {
"Sq"
} else {
""
}
})))
all(Good == GoodMy, Best == BestMy) #yields True
我试图从下面的文本字符串中提取特定的字符串子集。
string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)",
"scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)",
"I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)",
"scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)",
"scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)",
"I(scale(Slope_30)^2)")
一个好的结果是 return 中央文本没有任何特殊字符,如下所示。
Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope",
"SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope")
但是,生成的字符串最好分别说明与 'Slope' 和 'SlopeVar' 关联的 ^2
和 log
。具体来说,所有包含 ^2
的字符串将被转换为 'SlopeSq',所有包含 log
的字符串将被转换为 'SlopeVarPs',如下所示。
Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
"SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq")
我有一个冗长、丑陋且低效的代码序列,使我接近一半的好结果,我将不胜感激任何建议。
作为一个效率不高的编码员,我喜欢用多个正则表达式链来实现结果(每行正则表达式的作用在每一行中都有注释):
library(stringr)
library(dplyr)
string %>%
str_replace_all(".*log\((.*?)(_.+?)?\).*", "\1Ps") %>% # deal with "log" entry
str_replace_all(".*\((.*?\))", "\1") %>% # delete anything before the last "("
str_replace_all("(_\d+)?\)\^2", "Sq") %>% # take care of ^2
str_replace_all("(_.+)?\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")")
Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
"SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq")
all(outcome == Best)
## TRUE
我认为这可以通过包 stringr 来实现。
首先,因为您想要最里面的括号内的 "central text"。因此,下面的正则表达式排除了括号内包含括号的任何文本。但是我保留了 "log/^2" 以备后用。
string_step <- str_extract(string,
"(log|)\([^()]+\)(\^2|)")
然后我注意到下划线之后的任何内容都被截断了,但只保留了字母(和数字)的短语。 Unlike \w (\w in R), which includes underscore, "[:alnum:]+" equals "[A-Za-z0-9]+",因此被使用。
GoodMy <-
str_extract(str_replace_all(string_step, "log|\(|\)|\^2", ""),
"[:alnum:]+")
BestMy <-
paste0(Good, as.character(sapply(string_step, function(x) {
if (str_detect(x, "log")) {
"Ps"
} else if (str_detect(x, "\^2")) {
"Sq"
} else {
""
}
})))
all(Good == GoodMy, Best == BestMy) #yields True