交叉验证中的主成分分析;然而,只有一部分变量
PCA within cross validation; however, only with a subset of variables
这个问题和preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret非常相似,为了方便起见,我会使用这个数据(PimaIndiansDiabetes
)(这不是我的项目数据,但概念应该是一样的)。然后我想只对变量的一个子集进行预处理,即 PimaIndiansDiabetes[ c(4,5,6)]。有办法吗?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
但是我得到这个错误:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
我尝试这样做的原因是我可以将三个变量减少到一个 PCA1 并用于预测。在我正在做的项目中,所有三个变量的相关性都在 90% 以上,但我想将它们结合起来,因为其他研究也使用了它们。谢谢。正在努力避免数据泄露!
据我所知,使用插入符号这是不可能的。
这可能使用 recipes. However I do not use recipes but I do use mlr3 是可能的,所以我将展示如何使用这个包来做到这一点:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
根据数据创建任务:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
定义一个名为"slct1":
的预处理选择器
pos1 <- po("select", id = "slct1")
并在其中定义选择器函数:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
现在定义所选特征应该发生什么:缩放 -> pca 选择第一台 PC (param_vals = list(rank. = 1)
)
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
现在定义一个反向选择器:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
定义学习器:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
合并它们:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
检查是否正常:
graph$plot(html = TRUE)
将图转换为学习器:
glrn <- GraphLearner$new(graph)
定义要调整的参数:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
定义重采样:
cv10 <- rsmp("cv", folds = 10)
定义调整:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
有关如何调整要保留的 PC 组件数量的更多详细信息,请查看此答案:
如果您觉得这很有趣,请查看 mlr3book
还有
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
不会产生您在问题中提到的内容。
这个问题和preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret非常相似,为了方便起见,我会使用这个数据(PimaIndiansDiabetes
)(这不是我的项目数据,但概念应该是一样的)。然后我想只对变量的一个子集进行预处理,即 PimaIndiansDiabetes[ c(4,5,6)]。有办法吗?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
但是我得到这个错误:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
我尝试这样做的原因是我可以将三个变量减少到一个 PCA1 并用于预测。在我正在做的项目中,所有三个变量的相关性都在 90% 以上,但我想将它们结合起来,因为其他研究也使用了它们。谢谢。正在努力避免数据泄露!
据我所知,使用插入符号这是不可能的。 这可能使用 recipes. However I do not use recipes but I do use mlr3 是可能的,所以我将展示如何使用这个包来做到这一点:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
根据数据创建任务:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
定义一个名为"slct1":
的预处理选择器pos1 <- po("select", id = "slct1")
并在其中定义选择器函数:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
现在定义所选特征应该发生什么:缩放 -> pca 选择第一台 PC (param_vals = list(rank. = 1)
)
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
现在定义一个反向选择器:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
定义学习器:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
合并它们:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
检查是否正常:
graph$plot(html = TRUE)
将图转换为学习器:
glrn <- GraphLearner$new(graph)
定义要调整的参数:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
定义重采样:
cv10 <- rsmp("cv", folds = 10)
定义调整:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
有关如何调整要保留的 PC 组件数量的更多详细信息,请查看此答案:
如果您觉得这很有趣,请查看 mlr3book
还有
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
不会产生您在问题中提到的内容。