循环遍历数据框并为在同一测试集上评估的每一列创建一个模型

Question

我有一个包含 20 多个列的数据框，我想为这些列中的每一列创建一个 glm 模型，然后在同一测试集上对其进行评估。这是我的尝试：

# Train-test splitting
smp_size <- floor(0.70 * nrow(x))
index <- sample(seq_len(nrow(x)),size = smp_size)
train <- x[index, ]
test <- x[-index, ]

for (i in 1:22) {

   names(train)[names(train) == names(train[i])] <- 'variab'
   names(test)[names(test) == names(test[i])] <- 'variab'

   mod <- glm(Y ~ variab, family = binomial, data = train)

  assign(paste0("val", sep = "_", letters[i]), as.numeric(performance(
    prediction(predict(mod, newdata = test, type = "response"),test$Y), 
    measure = "auc")@y.values[[1]]))
}

但这不起作用，它只是将名称 "variab" 分配给每一列，并最终运行每列的模型相同。如何使此循环遍历数据框中的每一列？

Answer 1

这是给你的一个主意。我希望这能满足您的需求。我不知道你的 performance() 或 prediction() 函数来自哪里，所以我从我的示例中删除了它们。

data(iris)
predictors <- names(iris)[-1]
response <- names(iris)[1]

# due to a ill chosen example data:
iris[,response] <- iris[,response]/max(iris[,response])

# sample
smp_size <- floor(.7*nrow(iris))
set.seed(20171212)
idx <- sample(seq_len(nrow(iris)), size=smp_size)
train <- iris[idx,]
test <- iris[-idx,]


for (i in predictors) {
  tmp.test <- data.frame(pred=get(i,test), resp=get(response, test))
  tmp.train <- data.frame(pred=get(i,train), resp=get(response, train))


  mod <- glm(resp ~ pred, family=binomial, data=tmp.train)

  assign(paste0("val", sep="_", i), data.frame(predicted=as.numeric(predict(mod, newdata=tmp.test, type="response")), actual=get(response,test)))
  }

基本上，这就是您已经做过的。您已经在使用 assign() 函数，我认为 get() 是它的补充，同样有用。我还支持尽可能不使用数字索引，并在使用循环时遍历名称，因为编写有效的 cat() 消息既简单又容易。

循环遍历数据框并为在同一测试集上评估的每一列创建一个模型

Loop though a data frame and create a model for each column which is evaluated on the same testing set

modeling

iterator

loops

r

dataframe