如何理解 randomForest R 包中 "data" 和 "subset" 的参数？

Question

参数

data：包含模型中变量的可选数据框。默认情况下，变量取自 randomForestis 从
subset：指示应使用哪些行的索引向量。（注意：如果给定，则必须命名此参数。）

我的问题：

为什么data参数是"optional"？如果data是可选的，训练数据从哪里来？ "By default the variables are taken from the environment which randomForestis called from" 到底是什么意思？
为什么我们需要subset参数？比方说，我们有 iris 数据集。如果我想用前100行作为训练数据集，我就selecttraining_data <- iris[1:100,]。何苦？使用 subset 有什么好处？

Answer 1

这种方法并不少见，当然也不是 randomForests 独有的。
```
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept)         disp  
#    29.59985     -0.04122  
```
因此，当 lm（在本例中）尝试解析公式 mpg~disp 中引用的变量时，它会查看 data（如果提供），然后在调用环境中查看。进一步的例子：
```
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept)         disp  
#    29.59985     -0.04122  
```
（注意 mpg2 不在 mtcars 中，因此这两种方法都用于查找数据。我不使用此功能，更喜欢在调用；如果不是这种情况，不难想到再现性受到影响的例子。
类似的，很多类似的函数（包括lm）都允许这个subset=参数，所以randomForests包含它的事实是一致的。我相信这只是一个方便的论点，因为以下内容大致相同：
```
lm(mpg~disp, data=mtcars, subset= cyl==4)

lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
```
使用 subset 允许稍微简单的引用（cyl 对比 mtcars$cyl），并且当引用变量的数量增加时它的效用会增加（即对于 "code golf" 目的）。但这也可以通过其他机制来完成，例如 with，所以...主要是个人喜好。

Edit：正如 joran 指出的那样，randomForest（以及其他但值得注意的是 not lm）可以使用公式（您通常会在其中使用数据参数）或通过使用参数 x 和 y 分别指定 predictor/response 参数来调用，如以下示例所示取自 ?randomForest（忽略其他不一致的参数）：

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

如何理解 randomForest R 包中 "data" 和 "subset" 的参数？

How to understand the arguments of "data" and "subset" in randomForest R package?

r

random-forest