使用 RandomForestClassifier.decision_path，我如何判断分类器使用了哪些样本来做出决定？

Question

我正在使用 RandomForestClassifier 对具有二元结果的样本进行分类（"does not have the thing" 与 "has the thing"）。从 RandomForestClassifier.decision_path 的结果中，如何确定哪些样本对分类决策有贡献？

documentation 说：

Returns:

indicator : sparse csr array, shape = [n_samples, n_nodes]

Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.

n_nodes_ptr : array of size (n_estimators + 1, )

The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.

不幸的是，这些术语对我来说是不透明的。 indicator[x:y] 在维度 [n_samples, n_nodes] 的矩阵上似乎是一个错误（不应该是 indicator[sample, n_nodes_ptr[i]:n_nodes_ptr[i+1]] 吗？），但即便如此，我也不确定该怎么做才能采取"node indicator" 并找到节点所指的特征。我可以找到使用 decision_path 用于 DecisionTreeClassifier 的示例，但不能用于 RandomForestClassifier.

Answer 1

当您意识到 sklearn 约定是在 numpy 矩阵中放置尽可能多的东西时，理解 RandomForestClassifier.decision_path 的输出会更容易。

decision_path return 是每个决策树 decision_path 的水平串联，第二个 return 值告诉您每个 sub-matrix 的边界.因此，在 RandomForestClassifier 上使用 decision_path 等同于在每个 RandomForestClassifier.estimators_ 上使用 decision_path。对于 one-row 个样本，您可以这样计算结果：

indicators, index_by_tree = classifier.decision_path(data_row)
indices = zip(index_by_tree, index_by_tree[1:])
for tree_classifier, (begin, end) in zip(classifier.estimators_, indices):
    tree = tree_classifier.tree_
    node_indices = indicators[0, begin:end].indices

树实例具有以下属性，而不是将每个节点视为单独的 object：

feature
value
children_left
children_right

每个都是数组或矩阵，记录由其索引标识的树节点的特征。例如，tree.feature[3] 告诉您节点 3 测试的是哪个特征； tree.value 以 3 维数组的形式告诉您树的值，第一个维度是节点编号，最后一个维度包含分类值和阈值。（我不知道二次元是什么，在我这里只有一个元素。）tree.children_left[5]告诉你节点5左边的节点号child，正如你猜到的那样，tree.children_right[6]告诉你节点6右边的节点号child。

除了这些数组，DecisionTreeClassifier.decision_path也是一个数组，其中decision_path[N]是non-zero如果节点#N被访问过决策过程。

要返回已测试的功能，您可以这样做：

for index in node_indices:
    feature = tree.feature[index]
    if feature >= 0:
        features.add(feature)  # where `features` is a set()

请注意，这会告诉您所测试的功能，而不是它们的价值或它们如何影响结果。

使用 RandomForestClassifier.decision_path，我如何判断分类器使用了哪些样本来做出决定？

Using RandomForestClassifier.decision_path, how do I tell which samples the classifier used to make a decision?

python

random-forest

scikit-learn

Returns:

indicator : sparse csr array, shape = [n_samples, n_nodes]

n_nodes_ptr : array of size (n_estimators + 1, )