使用 scikit RandomForestClassifier 的平均降低精度使用哪个精度分数

Question

我已经运行实施了此 website 上显示的“平均精度下降”措施：

在示例中，作者使用的是随机森林回归器 RandomForestRegressor，但我使用的是随机森林分类器 RandomForestClassifier。因此，我的问题是，我是否也应该使用 r2_score 来测量精度，或者我是否应该切换到经典精度 accuracy_score 或 matthews 相关系数 matthews_corrcoef？

这里有人问我要不要换。为什么？

感谢您的帮助！

这是网站上的代码，以防您懒得点击 :)

from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict

X = boston["data"]
Y = boston["target"]

rf = RandomForestRegressor()
scores = defaultdict(list)

#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    r = rf.fit(X_train, Y_train)
    acc = r2_score(Y_test, rf.predict(X_test))
    for i in range(X.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = r2_score(Y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True)

Answer 1

r2_score 用于回归（连续响应变量），而经典分类（离散分类变量）指标如 accuracy_score 和 f1_score roc_auc（最后两个如果你有不平衡的 y 标签是最合适的）是你任务的正确选择。

随机打乱输入数据矩阵中的每个特征并测量这些分类指标的下降听起来像是对特征重要性进行排序的有效方法。

使用 scikit RandomForestClassifier 的平均降低精度使用哪个精度分数

Which accuracy score to use for the Mean Decrease Accuracy with the scikit RandomForestClassifier

python

statistics

classification

machine-learning

scikit-learn