从数据框中的所有分类变量创建虚拟变量
Create dummy variables from all categorical variables in a dataframe
我需要对数据框中的所有分类列进行单一编码。我发现了这样的东西:
one_hot <- function(df, key) {
key_col <- dplyr::select_var(names(df), !! rlang::enquo(key))
df <- df %>% mutate(.value = 1, .id = seq(n()))
df <- df %>% tidyr::spread_(key_col, ".value", fill = 0, sep = "_") %>%
select(-.id)
}
但我不知道如何将它应用于所有分类列。
keys <- select_if(data, is.character)[-c(1:2)]
tmp <- map(keys, function(names) reduce(data, ~one_hot(.x, keys)))
抛出下一个错误
Error: var
must evaluate to a single number or a column name, not a list
更新:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
customers
编码后
id gender.female gender.male mood.happy mood.sad outcome
1 10 0 1 1 0 1
2 20 1 0 0 1 1
3 30 1 0 1 0 0
4 40 0 1 0 1 0
5 50 1 0 1 0 0
使用 dummies
包:
library(dummies)
dummy.data.frame(customers)
id genderfemale gendermale moodhappy moodsad outcome
1 10 0 1 1 0 1
2 20 1 0 0 1 1
3 30 1 0 1 0 0
4 40 0 1 0 1 0
5 50 1 0 1 0 0
这是使用 recipes
包的方法。
library(dplyr)
library(recipes)
# Declares which variables are the predictors
recipe(formula = outcome ~ .,
data = customers) %>%
# Declare that one-hot encoding will be applied to all nominal variables
step_dummy(all_nominal(),
one_hot = TRUE) %>%
# Based on the previous declarations, apply transformations to the data
# and return the resulting data frame
prep() %>%
juice()
mltools
和 data.table
的一行:
one_hot(as.data.table(customers))
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0
它一次性处理所有因子变量,并内置了一些关于如何处理 NA 和未使用的因子水平的不错的功能。
也是单线 fastDummies
包装。
fastDummies::dummy_cols(customers)
id gender mood outcome gender_male gender_female mood_happy mood_sad
1 10 male happy 1 1 0 1 0
2 20 female sad 1 0 1 0 1
3 30 female happy 0 0 1 1 0
4 40 male sad 0 1 0 0 1
5 50 female happy 0 0 1 1 0
我需要对数据框中的所有分类列进行单一编码。我发现了这样的东西:
one_hot <- function(df, key) {
key_col <- dplyr::select_var(names(df), !! rlang::enquo(key))
df <- df %>% mutate(.value = 1, .id = seq(n()))
df <- df %>% tidyr::spread_(key_col, ".value", fill = 0, sep = "_") %>%
select(-.id)
}
但我不知道如何将它应用于所有分类列。
keys <- select_if(data, is.character)[-c(1:2)]
tmp <- map(keys, function(names) reduce(data, ~one_hot(.x, keys)))
抛出下一个错误
Error:
var
must evaluate to a single number or a column name, not a list
更新:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
customers
编码后
id gender.female gender.male mood.happy mood.sad outcome
1 10 0 1 1 0 1
2 20 1 0 0 1 1
3 30 1 0 1 0 0
4 40 0 1 0 1 0
5 50 1 0 1 0 0
使用 dummies
包:
library(dummies)
dummy.data.frame(customers)
id genderfemale gendermale moodhappy moodsad outcome
1 10 0 1 1 0 1
2 20 1 0 0 1 1
3 30 1 0 1 0 0
4 40 0 1 0 1 0
5 50 1 0 1 0 0
这是使用 recipes
包的方法。
library(dplyr)
library(recipes)
# Declares which variables are the predictors
recipe(formula = outcome ~ .,
data = customers) %>%
# Declare that one-hot encoding will be applied to all nominal variables
step_dummy(all_nominal(),
one_hot = TRUE) %>%
# Based on the previous declarations, apply transformations to the data
# and return the resulting data frame
prep() %>%
juice()
mltools
和 data.table
的一行:
one_hot(as.data.table(customers))
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0
它一次性处理所有因子变量,并内置了一些关于如何处理 NA 和未使用的因子水平的不错的功能。
也是单线 fastDummies
包装。
fastDummies::dummy_cols(customers)
id gender mood outcome gender_male gender_female mood_happy mood_sad
1 10 male happy 1 1 0 1 0
2 20 female sad 1 0 1 0 1
3 30 female happy 0 0 1 1 0
4 40 male sad 0 1 0 0 1
5 50 female happy 0 0 1 1 0