插入符号:glmnet 警告 - x 应该是具有 2 列或更多列的矩阵

Caret: glmnet warning - x should be a matrix with 2 or more columns

当我将单个数值变量作为独立变量传递给插入符号中的 glmnet 时,我收到一条错误消息,指出 "x should be a matrix with 2 or more columns",但是当我传递单个因子变量时,训练函数会按预期执行。将因子变量添加到单个数值变量也可以按预期工作。为什么是这样?到目前为止这是非常有问题的。我知道使用 glmnet 您需要使用矩阵而不是数据框,但是插入符应该处理这种转换,因为它显然对因子变量起作用。此外,我需要能够在插入符框架内始终如一地实施我的分析,并且我需要我的数据作为数据框。这是一个示例,请忽略与此问题无关的观察太少而导致的警告消息。

任何帮助将不胜感激,因为我快要疯了!

df <- structure(list(Y = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
                             1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
                                                                                         "Yes"), class = "factor"), A = c("Yes", "Yes", "No", "No", "No", 
                                                                                                                          "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "N", 
                                                                                                                          "No", "No", "No", "No", "No"), B = c(30, 6, 12, 12, 12, 12, 12, 
                                                                                                                                                               4, 12, 32, 12, 12, 4, 24, 8, 12, 15, 6, 12, 12), C = structure(c(1L, 
                                                                                                                                                                                                                                1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                                                                                                                                                                                1L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Y", 
                                                                                                                                                                                                                                                                                                  "A", "B", "C"), row.names = c(NA, 20L), class = "data.frame")



# set up the grid
  tuneGrid <- expand.grid(.alpha = seq(0, 1, 0.05), .lambda = seq(0, 2, 0.05))
  ## 10-fold CV ##
  fitControl <- trainControl(method = 'cv', number = 10, classProbs = TRUE, summaryFunction = twoClassSummary) 

  #works with a single factor variable  (ignore warnings based on small sample size)
  train(Y ~ A, data=df[c("Y", "A")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

  #returns and error message when a single numeric independent variable is passed
  train(Y ~ B, data=df[c("Y", "B")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

  #works when a factor variable is added to the numeric variable (ignore warnings based on small sample size)
  train(Y ~ A + C, data=df[c("Y", "A", "C")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

试试这个技巧:

df$ones <- rep(1, nrow(df))
train(Y ~ ones+B, data=df[c("Y", "B", "ones")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

glmnet 函数在函数顶部附近执行检查:

np = dim(x)
if (is.null(np) | (np[2] <= 1)) 
    stop("x should be a matrix with 2 or more columns")

您可以通过 运行ning glmnet 自行查看完整代码,无需任何括号。

我相信它与一个因素一起工作的原因是插入符号已经预处理了您的数据集并且 运行 dummyVars 在任何因素列上,为因素的每个级别创建一个列。这在建模/机器学习中很常见,有时也称为 1-hot 编码或二进制编码。

具有值 'red'、'green' 和 'blue' 的类型因子列将产生名为 'red'、'green' 和 'blue'.