替换 stringr 中字符串变量的值
replacing values of a string variable in stringr
虽然我的数据框的许多值指的是相同的值,但它们的写法不同。
我应该更改一些列值以使它们相似。
我使用了 stringr 包 str_replace_all,但效果不是很好。它没有按照我的意愿行事。这是我的可重现数据和代码。
df <- data.frame(
stringsAsFactors = FALSE,
Var1 = c("16-pathway","16a-OH E1",
"16a-OHE1","16OHE","17-b-estradiol","17-OH-progesterone",
"17-OH-progesterone/ androstenedione ratio",
"17b-HSD (rs2830A)","17b-HSD (rs592389 G)","17b-HSD (rs615492 G)",
"17b-HSD (rs615942 G)","17b estradiol",
"17OH-progesterone","2-hydroxy (OH) E1","2-OHE-1","2-OHE-2",
"2-pathway","2:16 OHE ratio","2:16 pathway ratio","2:16a-OH E1",
"2:16OHE","2OHE","Adiponectin","androstenedione",
"Androstenedione","androstenedione (A)"),
Freq = c(2L,1L,4L,8L,1L,6L,6L,2L,
2L,1L,1L,1L,5L,1L,4L,4L,2L,4L,2L,1L,8L,8L,
8L,1L,62L,1L)
)
library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
c(#16OHE1
"16a-OH E1" = "16-OHE1",
"16a-OHE1" = "16-OHE1",
"16OHE" = "16-OHE1",
#17Beta estradiol
"17-b-estradiol" = "17-b-estradiol",
"17b estradiol"= "17-b-estradiol",
#Andreostenedione
"androstenedione" = "Androstenedione",
"Androstenedione" = "Androstenedione",
"androstenedione (A)" = "Androstenedione",
#2-OHE-1
"2-OHE-1" = "2-OHE-1",
"2-hydroxy (OH) E1" = "2-OHE-1")
)
现在,如果比较 Var1 和 new_var1,将“2-hydroxy (OH) E1”更改为“2-OHE-1”,将“Androstenedione (A)”更改为“雄烯二酮”。请参阅下面的屏幕截图。
在 str_replace_all 中,您需要在前面使用“双反斜杠”来转义 ( 和 )。试试下面的方法。 :)
df$new_var1 <- str_replace_all(df$Var1,
c(#16OHE1
"16a-OH E1" = "16-OHE1",
"16a-OHE1" = "16-OHE1",
"16OHE" = "16-OHE1",
"17-b-estradiol" = "17-b-estradiol",
"17b estradiol"= "17-b-estradiol",
"androstenedione" = "Androstenedione",
"Androstenedione" = "Androstenedione",
"androstenedione \(A\)" = "Androstenedione",
"2-OHE-1" = "2-OHE-1",
"2-hydroxy \(OH\) E1" = "2-OHE-1"))
您需要在代码中更改两处以获得所需的输出。第一个是 @Emax 提到的那个:用双反斜杠转义括号(\(
和 \)
)。此外,您需要注意替换的顺序,因为某些替换可能会影响后续替换的结果。这就是您的 OP "androstenedione \(A\)"
没有被 "Androstenedione"
替换的原因,因为替换 "androstenedione" = "Androstenedione"
发生在 "androstenedione \(A\)" = "Androstenedione"
之前。获得所需输出的一个简单解决方案是首先替换最具体的案例(例如,"androstenedione \(A\)"
),然后再替换更一般的案例(例如,"androstenedione"
)。
library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
c(#16OHE1
"16a-OH E1" = "16-OHE1",
"16a-OHE1" = "16-OHE1",
"16OHE" = "16-OHE1",
#17Beta estradiol
"17-b-estradiol" = "17-b-estradiol",
"17b estradiol"= "17-b-estradiol",
#Andreostenedione
"androstenedione \(A\)" = "Androstenedione",
"androstenedione" = "Androstenedione",
"Androstenedione" = "Androstenedione",
#2-OHE-1
"2-OHE-1" = "2-OHE-1",
"2-hydroxy \(OH\) E1" = "2-OHE-1")
)
这是一种使用 agrep
(模糊匹配)的方法,无需替换任何括号。如果需要,您可以添加 insertions、deletions 和 substitutions with agrep
作为其他示例。
替换
repl <- c(`16a-OH E1` = "16-OHE1", `16a-OHE1` = "16-OHE1", `16OHE` = "16-OHE1",
`17-b-estradiol` = "17-b-estradiol", `17b estradiol` = "17-b-estradiol",
androstenedione = "Androstenedione", Androstenedione = "Androstenedione",
`Androstenedione (A)` = "Androstenedione", `2-OHE-1` = "2-OHE-1",
`2-hydroxy (OH) E1` = "2-OHE-1")
df$new_var1 <- sapply(seq_along(df$Var1), function(x){
re=repl[agrep(df$Var1[x], names(repl))][1];
ifelse(is.na(re), df$Var1[x], re) })
df$new_var1
[1] "16-pathway"
[2] "16-OHE1"
[3] "16-OHE1"
[4] "16-OHE1"
[5] "17-b-estradiol"
[6] "17-OH-progesterone"
[7] "17-OH-progesterone/ androstenedione ratio"
[8] "17b-HSD (rs2830A)"
[9] "17b-HSD (rs592389 G)"
[10] "17b-HSD (rs615492 G)"
[11] "17b-HSD (rs615942 G)"
[12] "17-b-estradiol"
[13] "17OH-progesterone"
[14] "2-OHE-1"
[15] "2-OHE-1"
[16] "2-OHE-1"
[17] "2-pathway"
[18] "2:16 OHE ratio"
[19] "2:16 pathway ratio"
[20] "16-OHE1"
[21] "2:16OHE"
[22] "16-OHE1"
[23] "Adiponectin"
[24] "Androstenedione"
[25] "Androstenedione"
[26] "Androstenedione"
虽然我的数据框的许多值指的是相同的值,但它们的写法不同。 我应该更改一些列值以使它们相似。 我使用了 stringr 包 str_replace_all,但效果不是很好。它没有按照我的意愿行事。这是我的可重现数据和代码。
df <- data.frame(
stringsAsFactors = FALSE,
Var1 = c("16-pathway","16a-OH E1",
"16a-OHE1","16OHE","17-b-estradiol","17-OH-progesterone",
"17-OH-progesterone/ androstenedione ratio",
"17b-HSD (rs2830A)","17b-HSD (rs592389 G)","17b-HSD (rs615492 G)",
"17b-HSD (rs615942 G)","17b estradiol",
"17OH-progesterone","2-hydroxy (OH) E1","2-OHE-1","2-OHE-2",
"2-pathway","2:16 OHE ratio","2:16 pathway ratio","2:16a-OH E1",
"2:16OHE","2OHE","Adiponectin","androstenedione",
"Androstenedione","androstenedione (A)"),
Freq = c(2L,1L,4L,8L,1L,6L,6L,2L,
2L,1L,1L,1L,5L,1L,4L,4L,2L,4L,2L,1L,8L,8L,
8L,1L,62L,1L)
)
library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
c(#16OHE1
"16a-OH E1" = "16-OHE1",
"16a-OHE1" = "16-OHE1",
"16OHE" = "16-OHE1",
#17Beta estradiol
"17-b-estradiol" = "17-b-estradiol",
"17b estradiol"= "17-b-estradiol",
#Andreostenedione
"androstenedione" = "Androstenedione",
"Androstenedione" = "Androstenedione",
"androstenedione (A)" = "Androstenedione",
#2-OHE-1
"2-OHE-1" = "2-OHE-1",
"2-hydroxy (OH) E1" = "2-OHE-1")
)
现在,如果比较 Var1 和 new_var1,将“2-hydroxy (OH) E1”更改为“2-OHE-1”,将“Androstenedione (A)”更改为“雄烯二酮”。请参阅下面的屏幕截图。
在 str_replace_all 中,您需要在前面使用“双反斜杠”来转义 ( 和 )。试试下面的方法。 :)
df$new_var1 <- str_replace_all(df$Var1,
c(#16OHE1
"16a-OH E1" = "16-OHE1",
"16a-OHE1" = "16-OHE1",
"16OHE" = "16-OHE1",
"17-b-estradiol" = "17-b-estradiol",
"17b estradiol"= "17-b-estradiol",
"androstenedione" = "Androstenedione",
"Androstenedione" = "Androstenedione",
"androstenedione \(A\)" = "Androstenedione",
"2-OHE-1" = "2-OHE-1",
"2-hydroxy \(OH\) E1" = "2-OHE-1"))
您需要在代码中更改两处以获得所需的输出。第一个是 @Emax 提到的那个:用双反斜杠转义括号(\(
和 \)
)。此外,您需要注意替换的顺序,因为某些替换可能会影响后续替换的结果。这就是您的 OP "androstenedione \(A\)"
没有被 "Androstenedione"
替换的原因,因为替换 "androstenedione" = "Androstenedione"
发生在 "androstenedione \(A\)" = "Androstenedione"
之前。获得所需输出的一个简单解决方案是首先替换最具体的案例(例如,"androstenedione \(A\)"
),然后再替换更一般的案例(例如,"androstenedione"
)。
library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
c(#16OHE1
"16a-OH E1" = "16-OHE1",
"16a-OHE1" = "16-OHE1",
"16OHE" = "16-OHE1",
#17Beta estradiol
"17-b-estradiol" = "17-b-estradiol",
"17b estradiol"= "17-b-estradiol",
#Andreostenedione
"androstenedione \(A\)" = "Androstenedione",
"androstenedione" = "Androstenedione",
"Androstenedione" = "Androstenedione",
#2-OHE-1
"2-OHE-1" = "2-OHE-1",
"2-hydroxy \(OH\) E1" = "2-OHE-1")
)
这是一种使用 agrep
(模糊匹配)的方法,无需替换任何括号。如果需要,您可以添加 insertions、deletions 和 substitutions with agrep
作为其他示例。
替换
repl <- c(`16a-OH E1` = "16-OHE1", `16a-OHE1` = "16-OHE1", `16OHE` = "16-OHE1",
`17-b-estradiol` = "17-b-estradiol", `17b estradiol` = "17-b-estradiol",
androstenedione = "Androstenedione", Androstenedione = "Androstenedione",
`Androstenedione (A)` = "Androstenedione", `2-OHE-1` = "2-OHE-1",
`2-hydroxy (OH) E1` = "2-OHE-1")
df$new_var1 <- sapply(seq_along(df$Var1), function(x){
re=repl[agrep(df$Var1[x], names(repl))][1];
ifelse(is.na(re), df$Var1[x], re) })
df$new_var1
[1] "16-pathway"
[2] "16-OHE1"
[3] "16-OHE1"
[4] "16-OHE1"
[5] "17-b-estradiol"
[6] "17-OH-progesterone"
[7] "17-OH-progesterone/ androstenedione ratio"
[8] "17b-HSD (rs2830A)"
[9] "17b-HSD (rs592389 G)"
[10] "17b-HSD (rs615492 G)"
[11] "17b-HSD (rs615942 G)"
[12] "17-b-estradiol"
[13] "17OH-progesterone"
[14] "2-OHE-1"
[15] "2-OHE-1"
[16] "2-OHE-1"
[17] "2-pathway"
[18] "2:16 OHE ratio"
[19] "2:16 pathway ratio"
[20] "16-OHE1"
[21] "2:16OHE"
[22] "16-OHE1"
[23] "Adiponectin"
[24] "Androstenedione"
[25] "Androstenedione"
[26] "Androstenedione"