用 lookup table dplyr 替换字符串
Replacing strings with lookup table dplyr
我正在尝试在 R 中创建一个查找 table,以便获取与我工作的公司格式相同的数据。
它涉及我想使用 dplyr 合并的不同教育类别。
library(dplyr)
# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
tbl_df(data)
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
# Assign lookup table
data$X1 <- lut[data$X1]
但在我的输出中,我的旧值被替换为错误的值,即不是我在查找 table 中创建的值。相反,查找 table 似乎是随机分配的。
我发现最好的方法是使用 car
包
中的 recode()
# Observe that dplyr also has a recode function, so require car after dplyr
require(dplyr)
require(car)
数据是从中抽取的四个教育类别。
education <- c("Mechanichal Engineering",
"Electric Engineering","Political Science","Economics")
data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))
对数据使用 recode()
我重新编码类别
lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
"Electric Engineering" = "Engineering";
"Political Science" = "Social Science";
"Mechanichal Engineering" = "Engineering"'))
要查看它是否执行正确,请加入原始数据和重新编码的数据
data <- full_join(data, lut, by = "ID")
head(data)
ID X1 X2
1 1 Political Science Social Science
2 2 Economics Social Science
3 3 Electric Engineering Engineering
4 4 Political Science Social Science
5 5 Economics Social Science
6 6 Mechanichal Engineering Engineering
使用重新编码,您无需在重新编码之前对数据进行排序。
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)
======================= ==============
X1 new
======================= ==============
Political Science Social Science
Political Science Social Science
Mechanichal Engineering Engineering
Mechanichal Engineering Engineering
Political Science Social Science
Political Science Social Science
======================= ==============
我一直在尝试自己解决这个问题。我对找到的大多数解决方案都不太满意,所以这就是我最终得到的。我添加了一个 "other" 类别以表明它即使在查找 table.
中未定义值的情况下也能正常工作
library(dplyr)
# Create data
education <- c("Mechanichal Engineering",
"Electric Engineering",
"Political Science",
"Economics",
"Other")
data <- data.frame(X1 = replicate(1, sample(education, 20, rep=TRUE)))
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
data %>%
mutate(X2 = recode(X1, !!!lut))
#> X1 X2
#> 1 Electric Engineering Engineering
#> 2 Other Other
#> 3 Other Other
#> 4 Other Other
#> 5 Other Other
#> 6 Political Science Social Science
#> 7 Other Other
#> 8 Economics Social Science
#> 9 Political Science Social Science
#> 10 Electric Engineering Engineering
#> 11 Economics Social Science
#> 12 Economics Social Science
#> 13 Mechanichal Engineering Engineering
#> 14 Economics Social Science
#> 15 Political Science Social Science
#> 16 Other Other
#> 17 Other Other
#> 18 Other Other
#> 19 Mechanichal Engineering Engineering
#> 20 Political Science Social Science
我正在尝试在 R 中创建一个查找 table,以便获取与我工作的公司格式相同的数据。
它涉及我想使用 dplyr 合并的不同教育类别。
library(dplyr)
# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
tbl_df(data)
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
# Assign lookup table
data$X1 <- lut[data$X1]
但在我的输出中,我的旧值被替换为错误的值,即不是我在查找 table 中创建的值。相反,查找 table 似乎是随机分配的。
我发现最好的方法是使用 car
包
recode()
# Observe that dplyr also has a recode function, so require car after dplyr
require(dplyr)
require(car)
数据是从中抽取的四个教育类别。
education <- c("Mechanichal Engineering",
"Electric Engineering","Political Science","Economics")
data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))
对数据使用 recode()
我重新编码类别
lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
"Electric Engineering" = "Engineering";
"Political Science" = "Social Science";
"Mechanichal Engineering" = "Engineering"'))
要查看它是否执行正确,请加入原始数据和重新编码的数据
data <- full_join(data, lut, by = "ID")
head(data)
ID X1 X2
1 1 Political Science Social Science
2 2 Economics Social Science
3 3 Electric Engineering Engineering
4 4 Political Science Social Science
5 5 Economics Social Science
6 6 Mechanichal Engineering Engineering
使用重新编码,您无需在重新编码之前对数据进行排序。
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)
======================= ==============
X1 new
======================= ==============
Political Science Social Science
Political Science Social Science
Mechanichal Engineering Engineering
Mechanichal Engineering Engineering
Political Science Social Science
Political Science Social Science
======================= ==============
我一直在尝试自己解决这个问题。我对找到的大多数解决方案都不太满意,所以这就是我最终得到的。我添加了一个 "other" 类别以表明它即使在查找 table.
中未定义值的情况下也能正常工作library(dplyr)
# Create data
education <- c("Mechanichal Engineering",
"Electric Engineering",
"Political Science",
"Economics",
"Other")
data <- data.frame(X1 = replicate(1, sample(education, 20, rep=TRUE)))
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
data %>%
mutate(X2 = recode(X1, !!!lut))
#> X1 X2
#> 1 Electric Engineering Engineering
#> 2 Other Other
#> 3 Other Other
#> 4 Other Other
#> 5 Other Other
#> 6 Political Science Social Science
#> 7 Other Other
#> 8 Economics Social Science
#> 9 Political Science Social Science
#> 10 Electric Engineering Engineering
#> 11 Economics Social Science
#> 12 Economics Social Science
#> 13 Mechanichal Engineering Engineering
#> 14 Economics Social Science
#> 15 Political Science Social Science
#> 16 Other Other
#> 17 Other Other
#> 18 Other Other
#> 19 Mechanichal Engineering Engineering
#> 20 Political Science Social Science