cv.glmnet logit 模型警告(尽管二项式 类 具有超过 8 个 obs)?

cv.glmnet warnings for logit model (although binomial classes with more than 8 obs)?

我有一个简单的 table,我试图在其中提取我的协变量(基因)是否与癌症患者相关。由于有很多协变量 (~800),我正在 运行 使用 LASSO 惩罚和 glmnet() 进行逻辑回归,并使用 cv.glmnet() 进行交叉验证。第一部分似乎 运行 正常,没有任何警告。我在验证位上收到这些消息:

Warning messages:

1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has fewer than 8 observations; dangerous ground

2: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has fewer than 8 observations; dangerous ground

3: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold

这是我使用的数据样本(只有 7 个协变量):

> data
     Tumor     Probe_1     Probe_2    Probe_3      Probe_4    Probe_5    Probe_6     Probe_7
S_1     No -1.41509461 -3.92144111 -4.3319583 -4.894204000 -5.5379790  2.9031321  0.80587018
S_2     No -0.94584134 -2.77641045 -3.3560507 -2.211370963 -6.0006283  5.1775379  1.45389838
S_3     No -0.95188379 -3.47742475 -1.9058528 -3.019003727 -5.7203533  2.2121110  1.83080221
S_4     No -2.27462408 -3.83136845 -4.1285407 -1.691782991 -6.3683810  6.4500360  1.22882676
S_5     No -0.74983930 -2.51738976 -2.1747453 -2.279177452 -3.5778674  2.3518098  1.04400722
S_6     No -1.10189012 -3.12456412 -3.1800114 -2.567847449 -5.7474062  3.7589517  1.70868881
S_7    Yes  0.03970897 -1.98928788 -1.2119801 -0.686115233  1.0235521  0.3666321 -2.35612013
S_8    Yes  0.01597890 -1.20865821 -0.4579608 -1.192134064  1.4096178  2.4922013  0.40925359
S_9    Yes -0.27984931 -2.15706349 -2.4641827  0.047430187  1.6129360  0.5129123 -1.34833497
S_10   Yes  0.93021040 -1.97824406 -0.2918638  0.979103921 -2.5054538 -0.7654758 -2.48255982
S_11   Yes  0.83353713 -1.79506256 -2.0438707  0.460100440  0.9242979 -0.2319373 -1.51113570
S_12   Yes  0.18570649  0.05800963  0.2385482  0.433187887 -2.0097881  2.2284231  0.74761104
S_13   Yes  0.19232213 -0.95197653 -0.8496967 -0.105562938  1.0253468  0.6895510 -1.31659822
S_14   Yes  0.95731937 -1.53396032 -0.1456985  1.804472462 -3.3191177  0.2357909 -0.91231503
S_15   Yes  0.45860215 -1.36153814 -1.0998994 -0.003680416  2.0982345 -0.5042816 -1.07098039
S_16    No -0.02045748 -2.07952404 -1.5161549  1.095944357 -2.9224003  3.6426993  0.43034932
S_17    No  0.71109429 -1.19594432 -0.2472489 -0.333784895  0.7016542  0.1602559 -1.96375484
S_18    No  0.25009776 -0.98431835 -1.2113967 -0.062552222 -0.5772906  1.9909411  0.34956032
S_19    No  0.10396440 -1.43761294 -1.5490060 -0.900273908 -1.9889734  2.6280227  0.02848154
S_20    No -1.67179799 -0.69662635  0.3057564  0.497189699  1.8436791 -0.6753654 -1.74453932
S_21    No -0.33691459 -2.53752284 -2.7764968 -2.258180090  1.5861724  1.4335190  1.14224595
S_22    No -0.20888250 -3.32322098 -2.1782679  0.293379051 -5.8727867  2.3515395  1.89576377
S_23    No  0.48536983 -2.00023465 -0.8494739 -1.323411080 -6.1974792  0.2637433 -0.71707341
S_24    No  0.42733184 -2.23335363 -2.4388843  0.357150391 -2.8792254  0.4145872 -0.98182166

Tumor 列已设置为一个因素:

> data$Tumor
 [1] No  No  No  No  No  No  Yes Yes Yes Yes Yes Yes Yes Yes Yes No  No  No  No  No  No  No  No  No 
Levels: No Yes

准备数据并运行启用glmnet()函数:

b <- paste(colnames(data)[2:ncol(data)], collapse=" + ")
b <- as.formula(paste("~ ",b))

x <- model.matrix(b, data)

y <- data$Tumor

library("glmnet")
lasso_tumor <- glmnet(x, y, family="binomial", standardize=T, alpha=1, intercept = F)

到这里为止没有错误或警告消息。但是如果我现在 运行 cv.glmnet(),那些警告信息会出现:

> cv.lasso_tumor <- cv.glmnet(x, y, family="binomial", standardize=T, alpha=1, nfolds=10, parallel=TRUE, intercept=F)
Warning messages:
1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
2: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
3: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold

我的猜测是因为 Tumor 太小 (n=9) 无法 运行 验证,并且因为这一步随机拆分组,所以 Tumor 组将是相当有限。这有任何意义吗?我在 this thread 上读到这可能是一个问题,可以解决(@smci 的评论)。知道怎么做吗?

或者,您是否可以跳过交叉验证部分,只继续使用带套索的 logit?在那种情况下,lambda 找到那些与我的二项式分类相关的基因(这里称为“探针”)的合理截止点是多少?

非常感谢任何帮助。谢谢!

正如您已经发现的那样,问题出在 CV 程序中。如果您在 class 中的观察很少,当您将数据放入折叠时,可能会发生这种情况,在训练折叠的某些迭代中,class 的观察将少于 8 个,即 "dangerous" 用于优化算法。

作为第一个解决方案,您可以尝试将折叠数从 10 次减少到 5 次。如果这还不够,您可以尝试通过为每个折叠指定索引(参数 foldid)和确保在每次迭代中至少有 8 个观察值。否则 LOOCV 是一个选项,它更好但计算量更大。