r 数据集中一个变量的单热编码
r one-hot encoding for one variable in a dataset
我有一个数据集,我想在其中对一个变量进行单热编码并构建一个模型 (lm)。
这个变量叫做'zone'。
我尝试做的是:
lm_model <- train(formula(paste0("price ~", paste0(features, collapse = " + "))),
data = predict(dummyVars( ~ "zone", data = data_train), newdata = data_train),
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)
我不确定这部分,有人可以在这里指导我吗:
data = predict(dummyVars( ~ "zone", data = data_train), newdata = data_train),
让我们使用 cyl 作为来自 mtcars 的分类的示例:
library(caret)
da <- mtcars
da$cyl <- factor(da$cyl)
# we can include cyl as features
features <- c("cyl","hp","drat","wt","qsec")
#our dependent is mpg
我们检查 dummyVars 做了什么:
head(predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
cyl.4 cyl.6 cyl.8 hp drat wt qsec
Mazda RX4 0 1 0 110 3.90 2.620 16.46
Mazda RX4 Wag 0 1 0 110 3.90 2.875 17.02
Datsun 710 1 0 0 93 3.85 2.320 18.61
Hornet 4 Drive 0 1 0 110 3.08 3.215 19.44
Hornet Sportabout 0 0 1 175 3.15 3.440 17.02
Valiant 0 1 0 105 2.76 3.460 20.22
可以看到cyl引入了3个二元变量,同时保留了连续变量。因变量不在此预测中(...)
因此对于训练:
onehot_data <- cbind(mpg=da$mpg,
predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
lm_model <- train(mpg ~.,data=onehot_data,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)
它会向您发出警告:
Warning messages:
1: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
对于线性模型,插入符适合具有截距的模型。因为您只有一个分类值,所以您的截距将是您的 onehot 编码变量的线性组合。
您需要决定您的哪个分类将作为参考水平,并从 onehot 数据框中删除该列,例如:
# i remove cyl.4
onehot_data = onehot_data[,-2]
lm_model <- train(mpg ~.,data=onehot_data,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)
我有一个数据集,我想在其中对一个变量进行单热编码并构建一个模型 (lm)。
这个变量叫做'zone'。
我尝试做的是:
lm_model <- train(formula(paste0("price ~", paste0(features, collapse = " + "))),
data = predict(dummyVars( ~ "zone", data = data_train), newdata = data_train),
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)
我不确定这部分,有人可以在这里指导我吗:
data = predict(dummyVars( ~ "zone", data = data_train), newdata = data_train),
让我们使用 cyl 作为来自 mtcars 的分类的示例:
library(caret)
da <- mtcars
da$cyl <- factor(da$cyl)
# we can include cyl as features
features <- c("cyl","hp","drat","wt","qsec")
#our dependent is mpg
我们检查 dummyVars 做了什么:
head(predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
cyl.4 cyl.6 cyl.8 hp drat wt qsec
Mazda RX4 0 1 0 110 3.90 2.620 16.46
Mazda RX4 Wag 0 1 0 110 3.90 2.875 17.02
Datsun 710 1 0 0 93 3.85 2.320 18.61
Hornet 4 Drive 0 1 0 110 3.08 3.215 19.44
Hornet Sportabout 0 0 1 175 3.15 3.440 17.02
Valiant 0 1 0 105 2.76 3.460 20.22
可以看到cyl引入了3个二元变量,同时保留了连续变量。因变量不在此预测中(...)
因此对于训练:
onehot_data <- cbind(mpg=da$mpg,
predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
lm_model <- train(mpg ~.,data=onehot_data,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)
它会向您发出警告:
Warning messages:
1: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
对于线性模型,插入符适合具有截距的模型。因为您只有一个分类值,所以您的截距将是您的 onehot 编码变量的线性组合。
您需要决定您的哪个分类将作为参考水平,并从 onehot 数据框中删除该列,例如:
# i remove cyl.4
onehot_data = onehot_data[,-2]
lm_model <- train(mpg ~.,data=onehot_data,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)