将指标矩阵对应到 R 中的列名
Correspond Indicator Matrix to Column Names in R
我要解决的问题是 "How can I create an automated series of code which will pull the desired column header names from a data set to fit a general linearized model (glm)?" 我有一个包含 8 个变量的数据集;但是,我只想使用其中的 3 个来交叉验证并找到 "best" 模型。这是我想出的:
library(boot)
salary <- read.csv("salary_data.csv")
vars <- colnames(salary[c(2,3,7)])
nvars <- length(vars)
list.to.expand = vector(mode = "list", length = nvars)
for (i in 1:nvars){
list.to.expand[[i]]=c(0,1)
}
model.spec.matrix <- expand.grid(list.to.expand)
vars
model.spec.matrix
names(model.spec.matrix) <- vars
vars.to.use <- model.spec.matrix[2,]
vars.to.use <- as.numeric(vars.to.use)
cn <- c()
for (i in 1:nrow(model.spec.matrix)){
if(i==1){cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
}
}
print(cn)
paste(cn, collapse = "+")
glm.out = glm(paste("LogACG~",paste(cn,collapse = "+"),sep = ""), family = gaussian, data = salary)
cv.err = cv.glm(salary, glm.out, K = 10)$delta[1]
我的问题出在 for 循环上。我试图构建一个循环,它将 "vars" 中的值附加到 "model.use" 中,但我似乎无法让它读取矩阵中的第二行。有什么建议么?谢谢
这里似乎发生了几件事。
您已将 LogACG
作为倒数第二行的解释变量("LogACG~
,但它也是 vars
中的一个,最终以 model.vars
因为vars <- colnames(salary[c(2,3,7)])
,所以不对。
接下来,您的第二个 for
循环应该遍历 model.spec.matrix
的行,即
for(i in 1:nrow(model.spec.matrix)){
并以编程方式捕获该行指示的列名(变量),您可以这样做
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
在循环中。您还应该将 glm
和 cv.glm
移到循环内部。
但是,这将每次覆盖 glm.out
和 cv.err
,因此您需要将它们创建为空列表并在每次迭代中追加该列表。
因此最终产品将如下所示:
# Since you can't use LogACG to explain itself,
# suppose you meant to use Engineering as a candidate X
vars <- colnames(salary[c(2,3,8)])
# Make your grid
model.spec.matrix <- expand.grid(list.to.expand)
names(model.spec.matrix) <- vars
glm.out <- list(rep(NA, nrow(model.spec.matrix)))
cv.err <- list(rep(NA, nrow(model.spec.matrix)))
for(i in 1:nrow(model.spec.matrix)){
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
if(length(cn) == 0){cn <- "."}
tmp <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
glm.out[i] <- capture.output(tmp$formula)
}
# > glm.out
# [[1]]
# [1] "LogACG ~ ."
#
# [[2]]
# [1] "LogACG ~ Rank_Code"
#
# [[3]]
# [1] "LogACG ~ Gender"
#
# [[4]]
# [1] "LogACG ~ Rank_Code + Gender"
#
# [[5]]
# [1] "LogACG ~ Engineering"
#
# [[6]]
# [1] "LogACG ~ Rank_Code + Engineering"
#
# [[7]]
# [1] "LogACG ~ Gender + Engineering"
#
# [[8]]
# [1] "LogACG ~ Rank_Code + Gender + Engineering"
要获取列表元素中的整个模型对象,替换
glm.out[i] <- capture.output(tmp$formula)
与
glm.out <- append(glm.out, tmp)
即
for(i in 1:nrow(model.spec.matrix)){
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
if(length(cn) == 0){cn <- "."}
tmp <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
tmp1 <- cv.glm(salary, tmp, K = 10)$delta[1]
glm.out <- append(glm.out, tmp)
cv.err <- append(cv.err, tmp1)
}
> tail(cv.err)
[[1]]
[1] 2.751025
[[2]]
[1] 2.758954
[[3]]
[1] 2.735063
[[4]]
[1] 2.768075
[[5]]
[1] 2.774433
[[6]]
[1] 2.748291
此外,您最好使用 bestglm
之类的包或仅使用默认包 stats
中的 step
函数(请参阅 ?step
) .
我要解决的问题是 "How can I create an automated series of code which will pull the desired column header names from a data set to fit a general linearized model (glm)?" 我有一个包含 8 个变量的数据集;但是,我只想使用其中的 3 个来交叉验证并找到 "best" 模型。这是我想出的:
library(boot)
salary <- read.csv("salary_data.csv")
vars <- colnames(salary[c(2,3,7)])
nvars <- length(vars)
list.to.expand = vector(mode = "list", length = nvars)
for (i in 1:nvars){
list.to.expand[[i]]=c(0,1)
}
model.spec.matrix <- expand.grid(list.to.expand)
vars
model.spec.matrix
names(model.spec.matrix) <- vars
vars.to.use <- model.spec.matrix[2,]
vars.to.use <- as.numeric(vars.to.use)
cn <- c()
for (i in 1:nrow(model.spec.matrix)){
if(i==1){cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
}
}
print(cn)
paste(cn, collapse = "+")
glm.out = glm(paste("LogACG~",paste(cn,collapse = "+"),sep = ""), family = gaussian, data = salary)
cv.err = cv.glm(salary, glm.out, K = 10)$delta[1]
我的问题出在 for 循环上。我试图构建一个循环,它将 "vars" 中的值附加到 "model.use" 中,但我似乎无法让它读取矩阵中的第二行。有什么建议么?谢谢
这里似乎发生了几件事。
您已将 LogACG
作为倒数第二行的解释变量("LogACG~
,但它也是 vars
中的一个,最终以 model.vars
因为vars <- colnames(salary[c(2,3,7)])
,所以不对。
接下来,您的第二个 for
循环应该遍历 model.spec.matrix
的行,即
for(i in 1:nrow(model.spec.matrix)){
并以编程方式捕获该行指示的列名(变量),您可以这样做
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
在循环中。您还应该将 glm
和 cv.glm
移到循环内部。
但是,这将每次覆盖 glm.out
和 cv.err
,因此您需要将它们创建为空列表并在每次迭代中追加该列表。
因此最终产品将如下所示:
# Since you can't use LogACG to explain itself,
# suppose you meant to use Engineering as a candidate X
vars <- colnames(salary[c(2,3,8)])
# Make your grid
model.spec.matrix <- expand.grid(list.to.expand)
names(model.spec.matrix) <- vars
glm.out <- list(rep(NA, nrow(model.spec.matrix)))
cv.err <- list(rep(NA, nrow(model.spec.matrix)))
for(i in 1:nrow(model.spec.matrix)){
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
if(length(cn) == 0){cn <- "."}
tmp <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
glm.out[i] <- capture.output(tmp$formula)
}
# > glm.out
# [[1]]
# [1] "LogACG ~ ."
#
# [[2]]
# [1] "LogACG ~ Rank_Code"
#
# [[3]]
# [1] "LogACG ~ Gender"
#
# [[4]]
# [1] "LogACG ~ Rank_Code + Gender"
#
# [[5]]
# [1] "LogACG ~ Engineering"
#
# [[6]]
# [1] "LogACG ~ Rank_Code + Engineering"
#
# [[7]]
# [1] "LogACG ~ Gender + Engineering"
#
# [[8]]
# [1] "LogACG ~ Rank_Code + Gender + Engineering"
要获取列表元素中的整个模型对象,替换
glm.out[i] <- capture.output(tmp$formula)
与
glm.out <- append(glm.out, tmp)
即
for(i in 1:nrow(model.spec.matrix)){
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
if(length(cn) == 0){cn <- "."}
tmp <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
tmp1 <- cv.glm(salary, tmp, K = 10)$delta[1]
glm.out <- append(glm.out, tmp)
cv.err <- append(cv.err, tmp1)
}
> tail(cv.err)
[[1]]
[1] 2.751025
[[2]]
[1] 2.758954
[[3]]
[1] 2.735063
[[4]]
[1] 2.768075
[[5]]
[1] 2.774433
[[6]]
[1] 2.748291
此外,您最好使用 bestglm
之类的包或仅使用默认包 stats
中的 step
函数(请参阅 ?step
) .