k折交叉验证：如何根据Stata中随机生成的整数变量过滤数据

Question

下面的内容看起来很明显，但它的行为并不像我期望的那样。我想在不使用 SCC 包的情况下进行 k 折交叉验证，并认为我可以过滤我的数据和运行我自己对子集的回归。

首先，我生成一个具有 1 到 5 之间随机整数的变量（5 折交叉验证），然后循环遍历每个折数。我想按折叠数过滤数据，但使用布尔过滤器无法过滤任何内容。为什么？

奖金：捕获所有测试 MSE 并对其进行平均的最佳方法是什么？在 Python 中，我只会制作一个列表或一个 numpy 数组并取平均值。

gen randint = floor((6-1)*runiform()+1)

recast int randint

forval b = 1(1)5 {
    xtreg c.DepVar ///  // training set
    c.IndVar1 ///
    c.IndVar2 ///
    if randint !=`b' ///
    , fe vce(cluster uuid)

    xtreg c.DepVar /// // test set, needs to be performed with model above, not a               
    c.IndVar1 ///      // new model...
    c.IndVar2 ///
    if randint ==`b' ///
    , fe vce(cluster uuid)
}

编辑：测试集需要在模型适合训练集的情况下执行。我更改了代码中的注释以反映这一点。

最终过滤问题的解决方案是我在引号中使用标量来定义边界，我有：

replace randint = floor((`varscalar'-1)*runiform()+1)

而不只是

replace randint = floor((varscalar-1)*runiform()+1)

何时何地在 Stata 中使用引号让我感到困惑。我不能只在循环中使用 varscalar，我必须使用 `=varscalar'，但出于某种原因我可以使用 varscalar - 1 并获得预期的结果。有趣的是，我不能使用

replace randint = floor((`varscalar')*runiform()+1)

我想我应该使用

replace randint = floor((`=varscalar')*runiform()+1)

那为什么可以用带负一不带等号的版本呢？？

下面的回答还是非常有帮助的，我从中学到了很多。

Answer 1

事实上，这里发生了两件不一定直接相关的事情。 1) 如何使用随机生成的整数值过滤数据和 2) k 折交叉验证程序。

对于第一个，我将在下面留下一个示例，它可以帮助您使用 Stata 和一些可以轻松转移到其他问题（例如矩阵生成和操作来存储指标）的工具来解决问题。但是，我不会将您的代码草图和我的示例称为“k 折交叉验证”，主要是因为它们在测试和训练数据中都符合模型。尽管如此，严格来说，情况应该是，模型应该在训练数据中进行训练，并使用这些参数，评估模型在测试数据中的性能。

有关程序的进一步参考 Scikit-learn has done brilliant work explaining it，其中包括多个可视化效果。

话虽这么说，但这里有些东西可能会有所帮助。

clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
    randint |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         17       17.00       17.00
          2 |         18       18.00       35.00
          3 |         21       21.00       56.00
          4 |         19       19.00       75.00
          5 |         25       25.00      100.00
------------+-----------------------------------
      Total |        100      100.00 
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold"  "MSE_fold" "R2_hold"  "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix 
matrix li res
/*
res[5,4]
    R2_fold  MSE_fold   R2_hold  MSE_hold
1         .         .         .         .
2         .         .         .         .
3         .         .         .         .
4         .         .         .         .
5         .         .         .         .
*/

// loop over different samples
forvalues b = 1/5 {
    // run the model using fold == `b'
    qui reg y x1 x2 if randint ==`b' 
    // save R squared training
    matrix res[`b', 1] = e(r2) 
    // save rmse training
    matrix res[`b', 2] = e(rmse)  

    // run the model using fold != `b'
    qui reg y x1 x2 if randint !=`b' 
    // save R squared training (?)
    matrix res[`b', 3] = e(r2)
    // save rmse testing (?)
    matrix res[`b', 4] = e(rmse)  
}

// Show matrix with stored metrics
mat li res 
/*
res[5,4]
     R2_fold   MSE_fold    R2_hold   MSE_hold
1  .50949187  1.2877728  .74155365  1.0070531
2  .89942838  .71776458  .66401888   1.089422
3  .75542004  1.0870525  .68884359  1.0517139
4  .68140328  1.1103964  .71990589  1.0329239
5  .68816084  1.0017175  .71229925  1.0596865
*/

// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
      R2_fold   MSE_fold    R2_hold   MSE_hold
c1  .70678088  1.0409408  .70532425  1.0481599
*/

k折交叉验证：如何根据Stata中随机生成的整数变量过滤数据

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

stata