将一个因子转换为二进制虚拟变量,但不是所有存在的因子
Convert a factor into binary dummies but not all factors present
我有许多数据帧,其中包含一个因子,我希望将其扩展为多个二进制等价物(一个热编码)。然而,在每个数据框中并不是所有可能的因素都存在,但我知道所有可能的因素是什么(有 70 个这样的因素)。我想将所有可能的二进制虚拟对象添加到每个数据帧中。
从下面的代码中,我可以在每个数据框中创建虚拟对象,但不是所有可能的虚拟对象。例如,set1.df 在类别 "E" 或 "F" 中没有任何人,而 set2.df 在类别 "D" 中没有任何人。需要的是 set1.df 中全为 0 的 set1.dfE set1.dfF 列和全为零的 set2.df 列 set2.dfD。在创建虚拟对象之前我不能 rbind set1.df 和 set2.df 因为我需要在 rbinding 之前使用二进制变量对每个数据帧进行一些处理。只是重申一下,我事先知道我的数据可能有哪些级别,例如 "A" 到 "F".
library(dummies)
person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
set1.df <- data.frame(person_id,person_cat)
person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
set2.df <- data.frame(person_id,person_cat)
dummies1 <- dummy(set1.df[,2])
dummies2 <- dummy(set2.df[,2])
dummies1
dummies2
预期输出为:
> dummies1
set1.dfA set1.dfB set1.dfC set1.dfD set1.dfE set1.dfF
[1,] 1 0 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 0 1 0 0 0
[4,] 1 0 0 0 0 0
[5,] 0 1 0 0 0 0
[6,] 0 0 1 0 0 0
[7,] 0 0 0 1 0 0
[8,] 1 0 0 0 0 0
[9,] 1 0 0 0 0 0
[10,] 1 0 0 0 0 0
> dummies2
set2.dfA set2.dfB set2.dfC set2.df$D set2.dfE set2.dfF
[1,] 1 0 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 0 1 0 0 0
[4,] 1 0 0 0 0 0
[5,] 0 1 0 0 0 0
[6,] 0 0 1 0 0 0
[7,] 0 0 0 0 1 0
[8,] 0 0 0 0 1 0
[9,] 0 0 0 0 0 1
[10,] 1 0 0 0 0 0
这是一种解决方案:
levels <- c('A', 'B', 'C', 'D', 'E', 'F')
data <- data.frame(matrix(NA, nrow = length(person_id), ncol = length(levels)))
names(data) <- levels
for (i in 1:nrow(data)) {
for (j in 1:length(data)){
data[i, j] <- ifelse(set1.df[i, 2] == names(data)[j], 1, 0)
}
}
您应该创建一个空数据框,其行数与 ID 数相同,列数与您在 set1.df 中的级别数相同。然后,使用循环计算每一列中的 person_cat 。只有当 person_cat 等于列名 (category_level) 时,单元格的值才会为 1。
library(dummies)
person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
person_cat < -factor(person_cat,levels=c("A","B","C","D","E","F"))
set1.df <- data.frame(person_id,person_cat)
person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
person_cat <- factor(person_cat,levels=c("A","B","C","D","E","F"))
set2.df <- data.frame(person_id,person_cat)
dummies1 <- dummy(set1.df[,2],drop=FALSE)
dummies2 <- dummy(set2.df[,2],drop=FALSE)
dummies1
dummies2
我有许多数据帧,其中包含一个因子,我希望将其扩展为多个二进制等价物(一个热编码)。然而,在每个数据框中并不是所有可能的因素都存在,但我知道所有可能的因素是什么(有 70 个这样的因素)。我想将所有可能的二进制虚拟对象添加到每个数据帧中。
从下面的代码中,我可以在每个数据框中创建虚拟对象,但不是所有可能的虚拟对象。例如,set1.df 在类别 "E" 或 "F" 中没有任何人,而 set2.df 在类别 "D" 中没有任何人。需要的是 set1.df 中全为 0 的 set1.dfE set1.dfF 列和全为零的 set2.df 列 set2.dfD。在创建虚拟对象之前我不能 rbind set1.df 和 set2.df 因为我需要在 rbinding 之前使用二进制变量对每个数据帧进行一些处理。只是重申一下,我事先知道我的数据可能有哪些级别,例如 "A" 到 "F".
library(dummies)
person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
set1.df <- data.frame(person_id,person_cat)
person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
set2.df <- data.frame(person_id,person_cat)
dummies1 <- dummy(set1.df[,2])
dummies2 <- dummy(set2.df[,2])
dummies1
dummies2
预期输出为:
> dummies1
set1.dfA set1.dfB set1.dfC set1.dfD set1.dfE set1.dfF
[1,] 1 0 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 0 1 0 0 0
[4,] 1 0 0 0 0 0
[5,] 0 1 0 0 0 0
[6,] 0 0 1 0 0 0
[7,] 0 0 0 1 0 0
[8,] 1 0 0 0 0 0
[9,] 1 0 0 0 0 0
[10,] 1 0 0 0 0 0
> dummies2
set2.dfA set2.dfB set2.dfC set2.df$D set2.dfE set2.dfF
[1,] 1 0 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 0 1 0 0 0
[4,] 1 0 0 0 0 0
[5,] 0 1 0 0 0 0
[6,] 0 0 1 0 0 0
[7,] 0 0 0 0 1 0
[8,] 0 0 0 0 1 0
[9,] 0 0 0 0 0 1
[10,] 1 0 0 0 0 0
这是一种解决方案:
levels <- c('A', 'B', 'C', 'D', 'E', 'F')
data <- data.frame(matrix(NA, nrow = length(person_id), ncol = length(levels)))
names(data) <- levels
for (i in 1:nrow(data)) {
for (j in 1:length(data)){
data[i, j] <- ifelse(set1.df[i, 2] == names(data)[j], 1, 0)
}
}
您应该创建一个空数据框,其行数与 ID 数相同,列数与您在 set1.df 中的级别数相同。然后,使用循环计算每一列中的 person_cat 。只有当 person_cat 等于列名 (category_level) 时,单元格的值才会为 1。
library(dummies)
person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
person_cat < -factor(person_cat,levels=c("A","B","C","D","E","F"))
set1.df <- data.frame(person_id,person_cat)
person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
person_cat <- factor(person_cat,levels=c("A","B","C","D","E","F"))
set2.df <- data.frame(person_id,person_cat)
dummies1 <- dummy(set1.df[,2],drop=FALSE)
dummies2 <- dummy(set2.df[,2],drop=FALSE)
dummies1
dummies2