隔离森林背景下的基本估计量含义

Base estimator meaning in the context of Isolation forest

我很难理解“基础估计器”在隔离林环境中的含义。

scikit-learn 中 Isolation Forest 方法的参数之一是 n_estimators；它在 sklearn docs 中的描述如下：

The number of base estimators in the ensemble.

我尝试解释有关 Sklearn 的文档以及 Google 和 Youtube 上的内容理解这个术语但没有运气。有人可以解释一下它在 IF 上下文中的含义吗？

tl;dr：它是 original paper 中称为 Isolation Tree (iTree) 的特殊决策树:

We show in this paper that a tree structure can be constructed effectively to isolate every single instance. [...] This isolation characteristic of tree forms the basis of our method to detect anomalies, and we call this tree Isolation Tree or iTree.

The proposed method, called Isolation Forest or iForest, builds an ensemble of iTrees for a given data set [...]

所有集成方法（隔离林 belongs) consist of base estimators (i.e. they are exactly ensembles of base estimators); from the sklearn guide:

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

例如，在Random Forest（可以说是隔离森林这个名字的灵感来源）中，这个基础估计器是一个简单的决策树：

n_estimators : int, default=100

The number of trees in the forest.

类似于Gradient Boosting Trees (despite the scikit-learn docs referring to them as "boosting stages", they are decision trees nevertheless), Extra Trees等算法

在所有这些算法中，基本估计器都是固定的（尽管其特定参数可能会随着集成参数中的设置而变化）。还有另一类集成方法，其中用作基本估计器的确切模型也可以通过相应的参数 base_estimator 设置；例如，这里是 Bagging Classifier:

base_estimator : object, default=None

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.

和AdaBoost:

base_estimator : object, default=None

The base estimator from which the boosted ensemble is built. [...] If None, then the base estimator is DecisionTreeClassifier(max_depth=1).

从历史上看，第一个集成是使用各种版本的决策树构建的，并且可以说今天仍然是决策树（或变体，如 iTrees）几乎专门用于此类集成；引用我在中的另一个答案：

Adaboost (and similar ensemble methods) were conceived using decision trees as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1). DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer much when used as base classifiers.

隔离森林背景下的基本估计量含义

Base estimator meaning in the context of Isolation forest

machine-learning

scikit-learn

ensemble-learning