使用重新编码清理数据框列

Use recode to clean data frame column

如何使用 recode() 来 "clean/strip" 数据框中某列的某些部分?原始数据框如下所示:

df <- data.frame(duration = c("concentration, up to 2 minutes", "concentration, up to 4 minutes", "up to 6 hours"), name = c("Earth", "Water", "Fire"))

改进后的版本是这样的:

df <- data.frame(duration = c("2 minutes", "4 minutes", "6 hours"), name = c("Earth", "Water", "Fire"))

所以,我应该删除 "concentration," 和 "up to" 或使用 recode 函数将其替换为空字符串。

请找到 dplyr::recode()strings::str_remove() 的解决方案。

不过我的建议是也学习后者。这样您将能够学习更强大的方法来通过正则表达式转换字符串。

dplyr::recode()

的解决方案
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(duration = c("concentration, up to 2 minutes", 
                              "concentration, up to 4 minutes", 
                              "up to 6 hours"), 
                 name = c("Earth", "Water", "Fire"))

df$duration = recode(df$duration, 
                     "concentration, up to 2 minutes" = "2 minutes",
                     "concentration, up to 4 minutes" = "4 minutes",
                     "up to 6 hours" = "6 hours" )
df
#>    duration  name
#> 1 2 minutes Earth
#> 2 4 minutes Water
#> 3   6 hours  Fire

reprex package (v0.3.0)

于 2020-05-04 创建

stringr::str_remove()

的解决方案
library(stringr)
df <- data.frame(duration = c("concentration, up to 2 minutes", 
                              "concentration, up to 4 minutes", 
                              "up to 6 hours"), 
                 name = c("Earth", "Water", "Fire"))


df$duration = str_remove( df$duration, "^.*(?=\d)")
df
#>    duration  name
#> 1 2 minutes Earth
#> 2 4 minutes Water
#> 3   6 hours  Fire

reprex package (v0.3.0)

于 2020-05-04 创建