为什么我不能在 bestglm 的输出上使用 cv.glm?

Why can't I use cv.glm on the output of bestglm?

我正在尝试对葡萄酒数据集进行最佳子集选择,然后我想使用 10 倍 CV 获得测试错误率。我使用的代码是 -

cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
    bestglm(Xy = winedata,
            family = binomial,          # binomial family for logistic
            IC = "AIC",                 # Information criteria
            method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)

但是,这给出了错误 -

Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"

我认为 $BestModel 是代表最佳拟合的 lm 对象,manual 也这么说。如果是这样,那为什么我不能在 cv.glm 的帮助下使用 10 倍 CV 找到测试错误?

使用的数据集是https://archive.ics.uci.edu/ml/datasets/Wine+Quality的白葡萄酒数据集,使用的包是cv.glmboot包,bestglm包。

数据处理为 -

winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good"      #rename 'quality' to 'good'

bestglm fit 重新排列您的数据并将您的响应变量命名为 y,因此如果您将它传回 cv.glm,winedata 没有列 y,之后一切都会崩溃

检查什么是 class 总是好的:

class(res.best.logistic$BestModel)
[1] "glm" "lm" 

但是如果你看一下res.best.logistic$BestModel的调用:

res.best.logistic$BestModel$call

glm(formula = y ~ ., family = family, data = Xi, weights = weights)

head(res.best.logistic$BestModel$model)
  y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0           7.0             0.27        0.36           20.7     0.045
2 0           6.3             0.30        0.34            1.6     0.049
3 0           8.1             0.28        0.40            6.9     0.050
4 0           7.2             0.23        0.32            8.5     0.058
5 0           7.2             0.23        0.32            8.5     0.058
6 0           8.1             0.28        0.40            6.9     0.050
  free.sulfur.dioxide density   pH sulphates
1                  45  1.0010 3.00      0.45
2                  14  0.9940 3.30      0.49
3                  30  0.9951 3.26      0.44
4                  47  0.9956 3.19      0.40
5                  47  0.9956 3.19      0.40
6                  30  0.9951 3.26      0.44

你可以在通话等中替换东西,但是太乱了。拟合并不昂贵,所以对 winedata 进行拟合并将其传递给 cv.glm:

best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")

best.cv.err<- cv.glm(winedata,fit,cost1, K=10)