R - 如何在保持其他列静止的同时对单个列进行热编码?
R - How to one hot encoding a single column while keep other columns still?
我有一个这样的数据框:
group student exam_passed subject
A 01 Y Math
A 01 N Science
A 01 Y Japanese
A 02 N Math
A 02 Y Science
B 01 Y Japanese
C 02 N Math
我想要实现的是以下结果:
group student exam_passed subject_Math subject_Science subject_Japanese
A 01 Y 1 0 0
A 01 N 0 1 0
A 01 Y 0 0 1
A 02 N 1 0 0
A 02 Y 0 1 0
B 01 Y 0 0 1
C 02 N 1 0 0
这里是测试数据框:
df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
试过for循环,但是原始数据太大处理不了,
mltools::one_hot(df, col = 'subject')
因为这个错误也不起作用:
Error in `[.data.frame`(dt, , cols, with = FALSE) :
unused argument (with = FALSE)
谁能帮我解决这个问题?谢谢!
您可以使用名字神秘的 contrasts
函数来做到这一点。
文档的相关部分:
if contrasts = FALSE
an identity matrix is returned.
这是一个基本的实现:
encode_onehot <- function(x, colname_prefix = "", colname_suffix = "") {
if (!is.factor(x)) {
x <- as.factor(x)
}
encoding_matrix <- contrasts(x, contrasts = FALSE)
encoded_data <- encoding_matrix[as.integer(x)]
colnames(encoded_data) <- paste0(colname_prefix, colnames(encoded_data), colname_suffix)
encoded_data
}
df <- cbind(df, encode_onehot(df$subject, "subject_"))
这是相当通用的,不依赖于其他库,并且应该相当快,除非是在非常大的数据集上。
您可以利用 R 将布尔值转换为整数。
像这样:
new.data<-cbind(
old.data,
math=as.integer(old.data$subject=="math")
)
另一种选择
library(dplyr)
df %>%
mutate(subject_Math = ifelse(subject=='Math', 1, 0),
subject_Science = ifelse(subject=='Science', 1, 0),
subject_Japanese = ifelse(subject=='Japanese', 1, 0))
这是使用 data.table
库和 caret
的更通用的解决方案
library(caret)
library(data.table)
dt <- data.table(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
vars <- 'subject'
separator <- '_'
bin_vars <- predict(dummyVars( as.formula(paste0("~",paste0(vars,collapse = "+"))),
data = dt, na.action = na.pass), newdata = dt)
colnames(bin_vars) <- paste0(gsub(vars,paste0(vars,separator),colnames(bin_vars)))
dt[,vars:=NULL]
dt <- cbind(dt,bin_vars)
require(tidyr)
require(dplyr)
df %>% mutate(value = 1) %>% spread(subject, value, fill = 0 )
group student exam_pass Japanese Math Science
1 A 01 N 0 0 1
2 A 01 Y 1 1 0
3 A 02 N 0 1 0
4 A 02 Y 0 0 1
5 B 01 Y 1 0 0
6 C 02 N 0 1 0
我有一个这样的数据框:
group student exam_passed subject
A 01 Y Math
A 01 N Science
A 01 Y Japanese
A 02 N Math
A 02 Y Science
B 01 Y Japanese
C 02 N Math
我想要实现的是以下结果:
group student exam_passed subject_Math subject_Science subject_Japanese
A 01 Y 1 0 0
A 01 N 0 1 0
A 01 Y 0 0 1
A 02 N 1 0 0
A 02 Y 0 1 0
B 01 Y 0 0 1
C 02 N 1 0 0
这里是测试数据框:
df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
试过for循环,但是原始数据太大处理不了,
mltools::one_hot(df, col = 'subject')
因为这个错误也不起作用:
Error in `[.data.frame`(dt, , cols, with = FALSE) :
unused argument (with = FALSE)
谁能帮我解决这个问题?谢谢!
您可以使用名字神秘的 contrasts
函数来做到这一点。
文档的相关部分:
if
contrasts = FALSE
an identity matrix is returned.
这是一个基本的实现:
encode_onehot <- function(x, colname_prefix = "", colname_suffix = "") {
if (!is.factor(x)) {
x <- as.factor(x)
}
encoding_matrix <- contrasts(x, contrasts = FALSE)
encoded_data <- encoding_matrix[as.integer(x)]
colnames(encoded_data) <- paste0(colname_prefix, colnames(encoded_data), colname_suffix)
encoded_data
}
df <- cbind(df, encode_onehot(df$subject, "subject_"))
这是相当通用的,不依赖于其他库,并且应该相当快,除非是在非常大的数据集上。
您可以利用 R 将布尔值转换为整数。
像这样:
new.data<-cbind(
old.data,
math=as.integer(old.data$subject=="math")
)
另一种选择
library(dplyr)
df %>%
mutate(subject_Math = ifelse(subject=='Math', 1, 0),
subject_Science = ifelse(subject=='Science', 1, 0),
subject_Japanese = ifelse(subject=='Japanese', 1, 0))
这是使用 data.table
库和 caret
library(caret)
library(data.table)
dt <- data.table(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
vars <- 'subject'
separator <- '_'
bin_vars <- predict(dummyVars( as.formula(paste0("~",paste0(vars,collapse = "+"))),
data = dt, na.action = na.pass), newdata = dt)
colnames(bin_vars) <- paste0(gsub(vars,paste0(vars,separator),colnames(bin_vars)))
dt[,vars:=NULL]
dt <- cbind(dt,bin_vars)
require(tidyr)
require(dplyr)
df %>% mutate(value = 1) %>% spread(subject, value, fill = 0 )
group student exam_pass Japanese Math Science
1 A 01 N 0 0 1
2 A 01 Y 1 1 0
3 A 02 N 0 1 0
4 A 02 Y 0 0 1
5 B 01 Y 1 0 0
6 C 02 N 0 1 0