替换 stringr 中字符串变量的值

Question

虽然我的数据框的许多值指的是相同的值，但它们的写法不同。我应该更改一些列值以使它们相似。我使用了 stringr 包 str_replace_all，但效果不是很好。它没有按照我的意愿行事。这是我的可重现数据和代码。

    df <- data.frame(
  stringsAsFactors = FALSE,
              Var1 = c("16-pathway","16a-OH E1",
                       "16a-OHE1","16OHE","17-b-estradiol","17-OH-progesterone",
                       "17-OH-progesterone/ androstenedione ratio",
                       "17b-HSD (rs2830A)","17b-HSD (rs592389 G)","17b-HSD (rs615492 G)",
                       "17b-HSD (rs615942 G)","17b estradiol",
                       "17OH-progesterone","2-hydroxy (OH) E1","2-OHE-1","2-OHE-2",
                       "2-pathway","2:16 OHE ratio","2:16 pathway ratio","2:16a-OH E1",
                       "2:16OHE","2OHE","Adiponectin","androstenedione",
                       "Androstenedione","androstenedione  (A)"),
              Freq = c(2L,1L,4L,8L,1L,6L,6L,2L,
                       2L,1L,1L,1L,5L,1L,4L,4L,2L,4L,2L,1L,8L,8L,
                       8L,1L,62L,1L)
  )

library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
                                  c(#16OHE1
                                    "16a-OH E1" = "16-OHE1", 
                                    "16a-OHE1" = "16-OHE1", 
                                    "16OHE" = "16-OHE1",
                                    
                                    #17Beta estradiol
                                    "17-b-estradiol" = "17-b-estradiol",
                                    "17b estradiol"= "17-b-estradiol",
                                    #Andreostenedione

                                    "androstenedione" = "Androstenedione",
                                    "Androstenedione" = "Androstenedione",
                                    "androstenedione  (A)" = "Androstenedione",

                                    #2-OHE-1
                                    "2-OHE-1" = "2-OHE-1",
                                    "2-hydroxy (OH) E1" = "2-OHE-1")
)

现在，如果比较 Var1 和 new_var1，将“2-hydroxy (OH) E1”更改为“2-OHE-1”，将“Androstenedione (A)”更改为“雄烯二酮”。请参阅下面的屏幕截图。

Answer 1

在 str_replace_all 中，您需要在前面使用“双反斜杠”来转义 ( 和 )。试试下面的方法。 :)

df$new_var1 <- str_replace_all(df$Var1,
                               c(#16OHE1
                                 "16a-OH E1" = "16-OHE1", 
                                 "16a-OHE1" = "16-OHE1", 
                                 "16OHE" = "16-OHE1",
                                 "17-b-estradiol" = "17-b-estradiol",
                                 "17b estradiol"= "17-b-estradiol",
                                 "androstenedione" = "Androstenedione",
                                 "Androstenedione" = "Androstenedione",
                                 "androstenedione  \(A\)" = "Androstenedione",
                                 "2-OHE-1" = "2-OHE-1",
                                 "2-hydroxy \(OH\) E1" = "2-OHE-1"))

Answer 2

您需要在代码中更改两处以获得所需的输出。第一个是 @Emax 提到的那个：用双反斜杠转义括号（\( 和 \)）。此外，您需要注意替换的顺序，因为某些替换可能会影响后续替换的结果。这就是您的 OP "androstenedione \(A\)" 没有被 "Androstenedione" 替换的原因，因为替换 "androstenedione" = "Androstenedione" 发生在 "androstenedione \(A\)" = "Androstenedione" 之前。获得所需输出的一个简单解决方案是首先替换最具体的案例（例如，"androstenedione \(A\)"），然后再替换更一般的案例（例如，"androstenedione"）。

library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
                               c(#16OHE1
                                 "16a-OH E1" = "16-OHE1", 
                                 "16a-OHE1" = "16-OHE1", 
                                 "16OHE" = "16-OHE1",
                                 #17Beta estradiol
                                 "17-b-estradiol" = "17-b-estradiol",
                                 "17b estradiol"= "17-b-estradiol",
                                 #Andreostenedione
                                 "androstenedione  \(A\)" = "Androstenedione",
                                 "androstenedione" = "Androstenedione",
                                 "Androstenedione" = "Androstenedione",
                                 #2-OHE-1
                                 "2-OHE-1" = "2-OHE-1",
                                 "2-hydroxy \(OH\) E1" = "2-OHE-1")
)

Answer 3

这是一种使用 agrep（模糊匹配）的方法，无需替换任何括号。如果需要，您可以添加 insertions、deletions 和 substitutions with agrep 作为其他示例。

替换

repl <- c(`16a-OH E1` = "16-OHE1", `16a-OHE1` = "16-OHE1", `16OHE` = "16-OHE1", 
`17-b-estradiol` = "17-b-estradiol", `17b estradiol` = "17-b-estradiol", 
androstenedione = "Androstenedione", Androstenedione = "Androstenedione", 
`Androstenedione  (A)` = "Androstenedione", `2-OHE-1` = "2-OHE-1", 
`2-hydroxy (OH) E1` = "2-OHE-1")

df$new_var1 <- sapply(seq_along(df$Var1), function(x){ 
  re=repl[agrep(df$Var1[x], names(repl))][1]; 
  ifelse(is.na(re), df$Var1[x], re) })

df$new_var1
 [1] "16-pathway"                               
 [2] "16-OHE1"                                  
 [3] "16-OHE1"                                  
 [4] "16-OHE1"                                  
 [5] "17-b-estradiol"                           
 [6] "17-OH-progesterone"                       
 [7] "17-OH-progesterone/ androstenedione ratio"
 [8] "17b-HSD (rs2830A)"                        
 [9] "17b-HSD (rs592389 G)"                     
[10] "17b-HSD (rs615492 G)"                     
[11] "17b-HSD (rs615942 G)"                     
[12] "17-b-estradiol"                           
[13] "17OH-progesterone"                        
[14] "2-OHE-1"                                  
[15] "2-OHE-1"                                  
[16] "2-OHE-1"                                  
[17] "2-pathway"                                
[18] "2:16 OHE ratio"                           
[19] "2:16 pathway ratio"                       
[20] "16-OHE1"                                  
[21] "2:16OHE"                                  
[22] "16-OHE1"                                  
[23] "Adiponectin"                              
[24] "Androstenedione"                          
[25] "Androstenedione"                          
[26] "Androstenedione"

替换 stringr 中字符串变量的值

replacing values of a string variable in stringr

r

stringr

替换