在 PySpark 多项逻辑回归中设置阈值
Set thresholds in PySpark multinomial logistic regression
我想执行多项逻辑回归,但我无法正确设置 threshold
和 thresholds
参数。考虑以下 DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
我的标签有3个类,所以我必须设置thresholds
(复数,默认为None
)而不是threshold
(单数,默认为0.5
)。然后我写:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
那我想把模型装到我的DF上:
test_logit = test_logit_abst.fit(test_train_df)
但是在执行最后一条命令时出现错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
错误提示 threshold
已设置。这看起来很奇怪,因为 documentation 表示设置 thresholds
(复数)会清除 threshold
(单数),因此应该删除值 0.5
。
那么,由于 clearThreshold()
不存在,如何清除 threshold
?
为了实现这一点,我尝试以这种方式清除 threshold
:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
这次fit命令起作用了,我连模型截距和系数都得到了:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
但是如果我尝试从 test_logit_abst
中获取 thresholds
(复数),我会得到一个错误:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
这是什么意思?
作为进一步的细节,奇怪的是(对我来说也是不可理解的)颠倒参数设置的顺序会产生我在上面发布的第一个错误:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
为什么更改 "set" 指令的顺序也会更改输出?
真是一团乱麻...
简短的答案是:
setThresholds
(复数)没有清除阈值(单数)似乎是一个错误
- 对于多项式 class 化(即 classes 的数量 > 2),
setThresholds
不会 你所期望的(并且可以说你不需要它)
- 如果您只需要 "thresholds" 在 "default" 值 0.5 中,您就没有问题 - 只需不要使用任何相关参数或
setThresholds
声明
- 如果您确实需要在 多项式 class 化中对不同的 class 应用不同的决策阈值,您将不得不手动完成,通过post-处理各自的概率,即转换后的数据框中的
probability
列(尽管使用 setThreshold(s)
for binary class化)
现在 长 答案...
让我们从二进制 class化开始,适配玩具数据from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
我们不需要在这里设置 thresholds
(复数)- threshold=0.7
就足够了,但在下面说明与 setThreshold
的差异时会很有用。
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
结果如下:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
thresholds=[0.3, 0.7]
是什么意思?答案在第 2 行,预测为 0.0
,尽管 1.0
的概率更高(0.65):0.65 确实高于 0.35,但低于我们为此 class (0.7) 设置了 的阈值,因此它没有 class 如此定义。
现在让我们尝试看似相同的操作,但使用setThreshold(s)
代替:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
不错吧?
setThresholds
(复数)似乎确实清除了我们在上一行中设置的阈值 (0.7),如文档中所述,但它似乎只是为了将其恢复为默认值的 0.5...
省略 .setThreshold(0.7)
会给出您自己报告的第一个错误(未显示)。
颠倒参数设置的顺序解决了问题(!!!),而且,使 getThreshold
(单数)和 getThresholds
(复数)都可操作(与您的情况相反):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
现在让我们转到多项式情况;我们将再次使用文档中的示例,使用 Spark Github repo 中的数据(它们也应该在本地可用,在您的 $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt
中,但我正在使用 Databricks 笔记本);这是一个 3-class 的案例,标签在 {0.0, 1.0, 2.0}
.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
与上面的二进制情况类似,我们的 thresholds
(复数)的元素总和为 1,让我们要求 class 2:
的阈值为 0.8
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
看起来不错,但让我们在(训练)数据集中请求一个预测:
mlorModel.transform(mdf).show(truncate=False)
我只挑出了一行 - 它应该是完整输出末尾的第二行:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
向右滚动,您会看到尽管此处 class 2.0
的预测 低于 我们设置的阈值(0.8),该行确实被预测为 2.0
- 与上面演示的二进制情况相反...
那么,怎么办?只需删除所有与阈值相关的语句;你不需要它们——甚至 setFamily
也是不必要的,因为算法会自行检测到你有超过 2 个 class。这将给出与上述相同的结果:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
至总结:
- 在二进制和多项式情况下,算法实际 return 是一个长度等于 classes 的概率向量,元素总和为 1 .
- 仅在二进制情况下,Spark 允许您更进一步,而不是天真地选择最高的
probability
class 作为 prediction
,而是应用用户定义的阈值;此设置可能有用,例如在数据不平衡的情况下。
- 此
threshold(s)
设置在 多项式 情况下实际上 无效 ,其中 Spark 将始终 return作为 prediction
最高 probability
的 class。
尽管文档中有混乱(关于I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere(原文强调):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
虽然上述论证是针对二元情况的,但它也完全适用于多项式情况...
我想执行多项逻辑回归,但我无法正确设置 threshold
和 thresholds
参数。考虑以下 DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
我的标签有3个类,所以我必须设置thresholds
(复数,默认为None
)而不是threshold
(单数,默认为0.5
)。然后我写:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
那我想把模型装到我的DF上:
test_logit = test_logit_abst.fit(test_train_df)
但是在执行最后一条命令时出现错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
错误提示 threshold
已设置。这看起来很奇怪,因为 documentation 表示设置 thresholds
(复数)会清除 threshold
(单数),因此应该删除值 0.5
。
那么,由于 clearThreshold()
不存在,如何清除 threshold
?
为了实现这一点,我尝试以这种方式清除 threshold
:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
这次fit命令起作用了,我连模型截距和系数都得到了:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
但是如果我尝试从 test_logit_abst
中获取 thresholds
(复数),我会得到一个错误:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
这是什么意思?
作为进一步的细节,奇怪的是(对我来说也是不可理解的)颠倒参数设置的顺序会产生我在上面发布的第一个错误:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
为什么更改 "set" 指令的顺序也会更改输出?
真是一团乱麻...
简短的答案是:
setThresholds
(复数)没有清除阈值(单数)似乎是一个错误- 对于多项式 class 化(即 classes 的数量 > 2),
setThresholds
不会 你所期望的(并且可以说你不需要它) - 如果您只需要 "thresholds" 在 "default" 值 0.5 中,您就没有问题 - 只需不要使用任何相关参数或
setThresholds
声明 - 如果您确实需要在 多项式 class 化中对不同的 class 应用不同的决策阈值,您将不得不手动完成,通过post-处理各自的概率,即转换后的数据框中的
probability
列(尽管使用setThreshold(s)
for binary class化)
现在 长 答案...
让我们从二进制 class化开始,适配玩具数据from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
我们不需要在这里设置 thresholds
(复数)- threshold=0.7
就足够了,但在下面说明与 setThreshold
的差异时会很有用。
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
结果如下:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
thresholds=[0.3, 0.7]
是什么意思?答案在第 2 行,预测为 0.0
,尽管 1.0
的概率更高(0.65):0.65 确实高于 0.35,但低于我们为此 class (0.7) 设置了 的阈值,因此它没有 class 如此定义。
现在让我们尝试看似相同的操作,但使用setThreshold(s)
代替:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
不错吧?
setThresholds
(复数)似乎确实清除了我们在上一行中设置的阈值 (0.7),如文档中所述,但它似乎只是为了将其恢复为默认值的 0.5...
省略 .setThreshold(0.7)
会给出您自己报告的第一个错误(未显示)。
颠倒参数设置的顺序解决了问题(!!!),而且,使 getThreshold
(单数)和 getThresholds
(复数)都可操作(与您的情况相反):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
现在让我们转到多项式情况;我们将再次使用文档中的示例,使用 Spark Github repo 中的数据(它们也应该在本地可用,在您的 $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt
中,但我正在使用 Databricks 笔记本);这是一个 3-class 的案例,标签在 {0.0, 1.0, 2.0}
.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
与上面的二进制情况类似,我们的 thresholds
(复数)的元素总和为 1,让我们要求 class 2:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
看起来不错,但让我们在(训练)数据集中请求一个预测:
mlorModel.transform(mdf).show(truncate=False)
我只挑出了一行 - 它应该是完整输出末尾的第二行:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
向右滚动,您会看到尽管此处 class 2.0
的预测 低于 我们设置的阈值(0.8),该行确实被预测为 2.0
- 与上面演示的二进制情况相反...
那么,怎么办?只需删除所有与阈值相关的语句;你不需要它们——甚至 setFamily
也是不必要的,因为算法会自行检测到你有超过 2 个 class。这将给出与上述相同的结果:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
至总结:
- 在二进制和多项式情况下,算法实际 return 是一个长度等于 classes 的概率向量,元素总和为 1 .
- 仅在二进制情况下,Spark 允许您更进一步,而不是天真地选择最高的
probability
class 作为prediction
,而是应用用户定义的阈值;此设置可能有用,例如在数据不平衡的情况下。 - 此
threshold(s)
设置在 多项式 情况下实际上 无效 ,其中 Spark 将始终 return作为prediction
最高probability
的 class。
尽管文档中有混乱(关于I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere(原文强调):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
虽然上述论证是针对二元情况的,但它也完全适用于多项式情况...