替换 stringr 中字符串变量的值

replacing values of a string variable in stringr

虽然我的数据框的许多值指的是相同的值,但它们的写法不同。 我应该更改一些列值以使它们相似。 我使用了 stringr 包 str_replace_all,但效果不是很好。它没有按照我的意愿行事。这是我的可重现数据和代码。

    df <- data.frame(
  stringsAsFactors = FALSE,
              Var1 = c("16-pathway","16a-OH E1",
                       "16a-OHE1","16OHE","17-b-estradiol","17-OH-progesterone",
                       "17-OH-progesterone/ androstenedione ratio",
                       "17b-HSD (rs2830A)","17b-HSD (rs592389 G)","17b-HSD (rs615492 G)",
                       "17b-HSD (rs615942 G)","17b estradiol",
                       "17OH-progesterone","2-hydroxy (OH) E1","2-OHE-1","2-OHE-2",
                       "2-pathway","2:16 OHE ratio","2:16 pathway ratio","2:16a-OH E1",
                       "2:16OHE","2OHE","Adiponectin","androstenedione",
                       "Androstenedione","androstenedione  (A)"),
              Freq = c(2L,1L,4L,8L,1L,6L,6L,2L,
                       2L,1L,1L,1L,5L,1L,4L,4L,2L,4L,2L,1L,8L,8L,
                       8L,1L,62L,1L)
  )

library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
                                  c(#16OHE1
                                    "16a-OH E1" = "16-OHE1", 
                                    "16a-OHE1" = "16-OHE1", 
                                    "16OHE" = "16-OHE1",
                                    
                                    #17Beta estradiol
                                    "17-b-estradiol" = "17-b-estradiol",
                                    "17b estradiol"= "17-b-estradiol",
                                    #Andreostenedione

                                    "androstenedione" = "Androstenedione",
                                    "Androstenedione" = "Androstenedione",
                                    "androstenedione  (A)" = "Androstenedione",

                                    #2-OHE-1
                                    "2-OHE-1" = "2-OHE-1",
                                    "2-hydroxy (OH) E1" = "2-OHE-1")
)

现在,如果比较 Var1 和 new_var1,将“2-hydroxy (OH) E1”更改为“2-OHE-1”,将“Androstenedione (A)”更改为“雄烯二酮”。请参阅下面的屏幕截图。

在 str_replace_all 中,您需要在前面使用“双反斜杠”来转义 ( 和 )。试试下面的方法。 :)

df$new_var1 <- str_replace_all(df$Var1,
                               c(#16OHE1
                                 "16a-OH E1" = "16-OHE1", 
                                 "16a-OHE1" = "16-OHE1", 
                                 "16OHE" = "16-OHE1",
                                 "17-b-estradiol" = "17-b-estradiol",
                                 "17b estradiol"= "17-b-estradiol",
                                 "androstenedione" = "Androstenedione",
                                 "Androstenedione" = "Androstenedione",
                                 "androstenedione  \(A\)" = "Androstenedione",
                                 "2-OHE-1" = "2-OHE-1",
                                 "2-hydroxy \(OH\) E1" = "2-OHE-1"))

您需要在代码中更改两处以获得所需的输出。第一个是 @Emax 提到的那个:用双反斜杠转义括号(\(\))。此外,您需要注意替换的顺序,因为某些替换可能会影响后续替换的结果。这就是您的 OP "androstenedione \(A\)" 没有被 "Androstenedione" 替换的原因,因为替换 "androstenedione" = "Androstenedione" 发生在 "androstenedione \(A\)" = "Androstenedione" 之前。获得所需输出的一个简单解决方案是首先替换最具体的案例(例如,"androstenedione \(A\)"),然后再替换更一般的案例(例如,"androstenedione")。

library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
                               c(#16OHE1
                                 "16a-OH E1" = "16-OHE1", 
                                 "16a-OHE1" = "16-OHE1", 
                                 "16OHE" = "16-OHE1",
                                 #17Beta estradiol
                                 "17-b-estradiol" = "17-b-estradiol",
                                 "17b estradiol"= "17-b-estradiol",
                                 #Andreostenedione
                                 "androstenedione  \(A\)" = "Androstenedione",
                                 "androstenedione" = "Androstenedione",
                                 "Androstenedione" = "Androstenedione",
                                 #2-OHE-1
                                 "2-OHE-1" = "2-OHE-1",
                                 "2-hydroxy \(OH\) E1" = "2-OHE-1")
)

这是一种使用 agrep(模糊匹配)的方法,无需替换任何括号。如果需要,您可以添加 insertionsdeletionssubstitutions with agrep 作为其他示例。

替换

repl <- c(`16a-OH E1` = "16-OHE1", `16a-OHE1` = "16-OHE1", `16OHE` = "16-OHE1", 
`17-b-estradiol` = "17-b-estradiol", `17b estradiol` = "17-b-estradiol", 
androstenedione = "Androstenedione", Androstenedione = "Androstenedione", 
`Androstenedione  (A)` = "Androstenedione", `2-OHE-1` = "2-OHE-1", 
`2-hydroxy (OH) E1` = "2-OHE-1")
df$new_var1 <- sapply(seq_along(df$Var1), function(x){ 
  re=repl[agrep(df$Var1[x], names(repl))][1]; 
  ifelse(is.na(re), df$Var1[x], re) })

df$new_var1
 [1] "16-pathway"                               
 [2] "16-OHE1"                                  
 [3] "16-OHE1"                                  
 [4] "16-OHE1"                                  
 [5] "17-b-estradiol"                           
 [6] "17-OH-progesterone"                       
 [7] "17-OH-progesterone/ androstenedione ratio"
 [8] "17b-HSD (rs2830A)"                        
 [9] "17b-HSD (rs592389 G)"                     
[10] "17b-HSD (rs615492 G)"                     
[11] "17b-HSD (rs615942 G)"                     
[12] "17-b-estradiol"                           
[13] "17OH-progesterone"                        
[14] "2-OHE-1"                                  
[15] "2-OHE-1"                                  
[16] "2-OHE-1"                                  
[17] "2-pathway"                                
[18] "2:16 OHE ratio"                           
[19] "2:16 pathway ratio"                       
[20] "16-OHE1"                                  
[21] "2:16OHE"                                  
[22] "16-OHE1"                                  
[23] "Adiponectin"                              
[24] "Androstenedione"                          
[25] "Androstenedione"                          
[26] "Androstenedione"