R:knn + pca,选择了未定义的列
R: knn + pca, undefined columns selected
我正在尝试在预测中使用 knn,但想先进行主成分分析以降低维度。
然而,在我生成主成分并将它们应用到 knn 上之后,它会生成错误
"Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected"
以及警告:
"In addition: Warning message: In nominalTrainWorkflow(x = x, y = y,
wts = weights, info = trainInfo, : There were missing values in
resampled performance measures."
这是我的样本:
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
训练集中的前15个
train1 = sample[1:15, ]
test = sample[16:20, ]
消除因变量
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
k = train(train1[,1] ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.control, preProcess='scale',
metric = "RMSE",
data = cbind(train1[,1], pca.tr))
如有任何建议,我们将不胜感激!
使用更好的列名和不带下标的公式。
您确实应该尝试 post 一个可重现的示例。你的一些代码是错误的。
此外,preProc
有一个 "pca" 方法,它通过在重采样内部重新计算 PCA 分数来做适当的事情。
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(55)
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
train1 = sample[1:15, ]
test = sample[16:20, ]
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
dat <- cbind(train1[,1], pca.tr) %>%
# This
setNames(c("y", "True", "PC1"))
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
set.seed(356)
k = train(y ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct, # this argument was wrong in your code
preProcess='scale',
metric = "RMSE",
data = dat)
k
#> k-Nearest Neighbors
#>
#> 15 samples
#> 2 predictor
#>
#> Pre-processing: scaled (2)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 4.979826 0.4332661 3.998205
#> 2 5.347236 0.3970251 4.312809
#> 3 5.016606 0.5977683 3.939470
#> 4 4.504474 0.8060368 3.662623
#> 5 5.612582 0.5104171 4.500768
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 4.
# or
set.seed(356)
train(X1 ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct,
preProcess= c('pca', 'scale'),
metric = "RMSE",
data = train1)
#> k-Nearest Neighbors
#>
#> 15 samples
#> 5 predictor
#>
#> Pre-processing: principal component signal extraction (5), scaled
#> (5), centered (5)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 13.373189 0.2450736 10.592047
#> 2 10.217517 0.2952671 7.973258
#> 3 9.030618 0.2727458 7.639545
#> 4 8.133807 0.1813067 6.445518
#> 5 8.083650 0.2771067 6.551053
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
由 reprex package (v0.2.1)
于 2019-04-15 创建
就 RMSE 而言,这些看起来更糟,但之前的 运行 低估了 RMSE,因为它假设 PCA 分数没有变化。
我正在尝试在预测中使用 knn,但想先进行主成分分析以降低维度。
然而,在我生成主成分并将它们应用到 knn 上之后,它会生成错误
"Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected"
以及警告:
"In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures."
这是我的样本:
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
训练集中的前15个
train1 = sample[1:15, ]
test = sample[16:20, ]
消除因变量
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
k = train(train1[,1] ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.control, preProcess='scale',
metric = "RMSE",
data = cbind(train1[,1], pca.tr))
如有任何建议,我们将不胜感激!
使用更好的列名和不带下标的公式。
您确实应该尝试 post 一个可重现的示例。你的一些代码是错误的。
此外,preProc
有一个 "pca" 方法,它通过在重采样内部重新计算 PCA 分数来做适当的事情。
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(55)
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
train1 = sample[1:15, ]
test = sample[16:20, ]
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
dat <- cbind(train1[,1], pca.tr) %>%
# This
setNames(c("y", "True", "PC1"))
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
set.seed(356)
k = train(y ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct, # this argument was wrong in your code
preProcess='scale',
metric = "RMSE",
data = dat)
k
#> k-Nearest Neighbors
#>
#> 15 samples
#> 2 predictor
#>
#> Pre-processing: scaled (2)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 4.979826 0.4332661 3.998205
#> 2 5.347236 0.3970251 4.312809
#> 3 5.016606 0.5977683 3.939470
#> 4 4.504474 0.8060368 3.662623
#> 5 5.612582 0.5104171 4.500768
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 4.
# or
set.seed(356)
train(X1 ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct,
preProcess= c('pca', 'scale'),
metric = "RMSE",
data = train1)
#> k-Nearest Neighbors
#>
#> 15 samples
#> 5 predictor
#>
#> Pre-processing: principal component signal extraction (5), scaled
#> (5), centered (5)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 13.373189 0.2450736 10.592047
#> 2 10.217517 0.2952671 7.973258
#> 3 9.030618 0.2727458 7.639545
#> 4 8.133807 0.1813067 6.445518
#> 5 8.083650 0.2771067 6.551053
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
由 reprex package (v0.2.1)
于 2019-04-15 创建就 RMSE 而言,这些看起来更糟,但之前的 运行 低估了 RMSE,因为它假设 PCA 分数没有变化。