将指标矩阵对应到 R 中的列名

Correspond Indicator Matrix to Column Names in R

我要解决的问题是 "How can I create an automated series of code which will pull the desired column header names from a data set to fit a general linearized model (glm)?" 我有一个包含 8 个变量的数据集;但是,我只想使用其中的 3 个来交叉验证并找到 "best" 模型。这是我想出的:

library(boot)
salary <- read.csv("salary_data.csv")
vars <- colnames(salary[c(2,3,7)])
nvars <- length(vars)
list.to.expand = vector(mode = "list", length = nvars)
for (i in 1:nvars){
  list.to.expand[[i]]=c(0,1)
}
model.spec.matrix <- expand.grid(list.to.expand)
vars
model.spec.matrix
names(model.spec.matrix) <- vars
vars.to.use <- model.spec.matrix[2,]
vars.to.use <- as.numeric(vars.to.use)
cn <- c()
for (i in 1:nrow(model.spec.matrix)){
  if(i==1){cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)]) 
  }
}
print(cn)
paste(cn, collapse = "+")
glm.out = glm(paste("LogACG~",paste(cn,collapse = "+"),sep = ""), family = gaussian, data = salary)
cv.err = cv.glm(salary, glm.out, K = 10)$delta[1]

我的问题出在 for 循环上。我试图构建一个循环,它将 "vars" 中的值附加到 "model.use" 中,但我似乎无法让它读取矩阵中的第二行。有什么建议么?谢谢

这里似乎发生了几件事。

您已将 LogACG 作为倒数第二行的解释变量("LogACG~,但它也是 vars 中的一个,最终以 model.vars因为vars <- colnames(salary[c(2,3,7)]),所以不对。

接下来,您的第二个 for 循环应该遍历 model.spec.matrix 的行,即

for(i in 1:nrow(model.spec.matrix)){

并以编程方式捕获该行指示的列名(变量),您可以这样做

cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])

在循环中。您还应该将 glmcv.glm 移到循环内部。

但是,这将每次覆盖 glm.outcv.err,因此您需要将它们创建为空列表并在每次迭代中追加该列表。

因此最终产品将如下所示:

# Since you can't use LogACG to explain itself, 
#    suppose you meant to use Engineering as a candidate X
vars <- colnames(salary[c(2,3,8)])

# Make your grid
model.spec.matrix        <- expand.grid(list.to.expand)
names(model.spec.matrix) <- vars

glm.out <- list(rep(NA, nrow(model.spec.matrix)))
cv.err  <- list(rep(NA, nrow(model.spec.matrix)))

for(i in 1:nrow(model.spec.matrix)){
  cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
  if(length(cn) == 0){cn <- "."}  
  tmp        <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
  glm.out[i] <- capture.output(tmp$formula)
}
# > glm.out
# [[1]]
# [1] "LogACG ~ ."
# 
# [[2]]
# [1] "LogACG ~ Rank_Code"
# 
# [[3]]
# [1] "LogACG ~ Gender"
# 
# [[4]]
# [1] "LogACG ~ Rank_Code + Gender"
# 
# [[5]]
# [1] "LogACG ~ Engineering"
# 
# [[6]]
# [1] "LogACG ~ Rank_Code + Engineering"
# 
# [[7]]
# [1] "LogACG ~ Gender + Engineering"
# 
# [[8]]
# [1] "LogACG ~ Rank_Code + Gender + Engineering"

要获取列表元素中的整个模型对象,替换

glm.out[i] <- capture.output(tmp$formula)

glm.out    <- append(glm.out, tmp)

for(i in 1:nrow(model.spec.matrix)){
  cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
  if(length(cn) == 0){cn <- "."}  
  tmp        <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
  tmp1       <- cv.glm(salary, tmp, K = 10)$delta[1]
  glm.out    <- append(glm.out, tmp)
  cv.err     <- append(cv.err, tmp1)
}

> tail(cv.err)
[[1]]
[1] 2.751025

[[2]]
[1] 2.758954

[[3]]
[1] 2.735063

[[4]]
[1] 2.768075

[[5]]
[1] 2.774433

[[6]]
[1] 2.748291

此外,您最好使用 bestglm 之类的包或仅使用默认包 stats 中的 step 函数(请参阅 ?step) .