随机森林分类器：class对应的概率

Question

我正在使用来自 pyspark.ml.classification

的 RandomForestClassifier

我运行二进制 class 数据集上的模型并显示概率。

我在列概率中有以下内容：

+-----+----------+---------------------------------------+
|label|prediction|probability                            |
+-----+----------+---------------------------------------+
|0.0  |0.0       |[0.9005918461098429,0.0994081538901571]|
|1.0  |1.0       |[0.6051335859900139,0.3948664140099861]|
+-----+----------+---------------------------------------+

我有一个包含 2 个元素的列表，它们显然对应于预测 class 的概率。

我的问题：概率[0始终对应于预测值，而在spark文档中不清楚！

Answer 1

我将您的问题解释为询问：'predictions' 列下的数组中的第一个元素是否始终对应于 "predicted class"，您的意思是随机森林分类器预测的标签观察应该有。

如果我猜对了，答案是肯定的。

probability 行中数组中的项目可以理解为模型告诉你：

['My confidence that the predicted label = the true label', 'My confidence that the label != the true label']

在预测多个标签的情况下，模型会告诉您：

['My confidence that the label I predict = specific label 1', 'My confidence that the label I predict = specific label 2', ...'My confidence that the label I predict = specific label N']

这是由您尝试预测的 N 个标签索引的（这意味着您必须注意标签的结构方式）。

也许看看 this answer 会有所帮助。你可以这样做：

model = pipeline.fit(trainig_data) predictions = model.transform(test_data) print predictions.show(10)

（使用示例中的相关管道和数据。）

这将显示每个 class 的概率。

Answer 2

我 post 几乎是同样的问题，我认为答案可能对你有帮助： Scala: how to know which probability correspond to which class?

答案在模型拟合之前。

为了拟合模型，我们在目标上使用了 labelIndexer。此标签索引器通过降低频率将目标转换为索引。

例如：如果在我的目标中我有 20% 的 "aa" 和 80% 的 "bb" 标签索引器将创建一个列 "label"，它为 "bb" 和 1 代表 "aa"（因为我 "bb" 比 "aa" 更频繁）

当我们拟合随机森林时，概率对应于频率的顺序。

二进制class化：

first proba = class 在训练集中 class 出现频率最高的概率
second proba = class 在训练集中 class 频率较低的概率

随机森林分类器：class对应的概率

Random Forest Classifier :To which class corresponds the probabilities

machine-learning

random-forest

apache-spark

pyspark

data-science