在 PySpark 多项逻辑回归中设置阈值

Question

我想执行多项逻辑回归，但我无法正确设置 threshold 和 thresholds 参数。考虑以下 DF：

from pyspark.ml.linalg import DenseVector

test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
                  (0, DenseVector([3.1, -2.0, -2.9])),
                  (1, DenseVector([1.0, 0.8, 0.3])),
                  (1, DenseVector([4.2, 1.4, -1.7])),
                  (0, DenseVector([-1.9, 2.5, -2.3])),
                  (2, DenseVector([2.6, -0.2, 0.2])),
                  (1, DenseVector([0.3, -3.4, 1.8])),
                  (2, DenseVector([-1.0, -3.5, 4.7]))],
                 ['label', 'features'])
)

我的标签有3个类，所以我必须设置thresholds（复数，默认为None）而不是threshold（单数，默认为0.5)。然后我写：

from pyspark.ml import classification as cl

test_logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
)

那我想把模型装到我的DF上：

test_logit = test_logit_abst.fit(test_train_df)

但是在执行最后一条命令时出现错误：

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:

Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.

During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
     62                 return self.copy(params)._fit(dataset)
     63             else:
---> 64                 return self._fit(dataset)
     65         else:
     66             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
    264     def _fit(self, dataset):
--> 265         java_model = self._fit_java(dataset)
    266         return self._create_model(java_model)
267

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    260         """
    261         self._transfer_params_to_java()
--> 262         return self._java_obj.fit(dataset._jdf)
263
    264     def _fit(self, dataset):

~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
1134
   1135         for temp_arg in temp_args:

~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'

错误提示 threshold 已设置。这看起来很奇怪，因为 documentation 表示设置 thresholds（复数）会清除 threshold（单数），因此应该删除值 0.5。那么，由于 clearThreshold() 不存在，如何清除 threshold？

为了实现这一点，我尝试以这种方式清除 threshold：

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
    .setThreshold(None)
)

这次fit命令起作用了，我连模型截距和系数都得到了：

test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])

test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)

但是如果我尝试从 test_logit_abst 中获取 thresholds（复数），我会得到一个错误：

test_logit_abst.getThresholds()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
    363         if not self.isSet(self.thresholds) and self.isSet(self.threshold):
    364             t = self.getOrDefault(self.threshold)
--> 365             return [1.0-t, t]
    366         else:
    367             return self.getOrDefault(self.thresholds)

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

这是什么意思？

作为进一步的细节，奇怪的是（对我来说也是不可理解的）颠倒参数设置的顺序会产生我在上面发布的第一个错误：

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThreshold(None)
    .setThresholds([.5, .5, .5])
)

为什么更改 "set" 指令的顺序也会更改输出？

Answer 1

真是一团乱麻...

简短的答案是：

setThresholds（复数）没有清除阈值（单数）似乎是一个错误
对于多项式 class 化（即 classes 的数量 > 2），setThresholds 不会你所期望的（并且可以说你不需要它）
如果您只需要 "thresholds" 在 "default" 值 0.5 中，您就没有问题 - 只需不要使用任何相关参数或 setThresholds声明
如果您确实需要在 多项式 class 化中对不同的 class 应用不同的决策阈值，您将不得不手动完成，通过post-处理各自的概率，即转换后的数据框中的 probability 列（尽管使用 setThreshold(s) for binary class化）

现在长答案...

让我们从二进制 class化开始，适配玩具数据from the docs:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
     Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
     Row(label=0.0, features=Vectors.dense(1.0, 2.0)),

blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
     Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
     Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()

我们不需要在这里设置 thresholds（复数）- threshold=0.7 就足够了，但在下面说明与 setThreshold 的差异时会很有用。

blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data

结果如下：

+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction                             |probability                             |prediction| 
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0  |[-1.138455151184087,1.138455151184087]    |[0.242604109995602,0.757395890004398]   |1.0       |
|[1.0,2.0]|0.0  |[-0.6056346859838877,0.6056346859838877]  |[0.35305562698104337,0.6469443730189567]|0.0       | 
|[2.0,1.0]|1.0  |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0       | 
|[3.0,3.0]|0.0  |[1.6453673835702176,-1.6453673835702176]  |[0.8382639556951765,0.16173604430482344]|0.0       | 
+---------+-----+------------------------------------------+----------------------------------------+----------+

thresholds=[0.3, 0.7]是什么意思？答案在第 2 行，预测为 0.0，尽管 1.0 的概率更高（0.65）：0.65 确实高于 0.35，但低于我们为此 class (0.7) 设置了 的阈值，因此它没有 class 如此定义。

现在让我们尝试看似相同的操作，但使用setThreshold(s)代替：

blor2 = (LogisticRegression() .setThreshold(0.7) .setThresholds([0.3, 0.7]) ) # works OK blorModel2 = blor2.fit(bdf) [...] IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'

不错吧？

setThresholds（复数）似乎确实清除了我们在上一行中设置的阈值 (0.7)，如文档中所述，但它似乎只是为了将其恢复为默认值的 0.5...

省略 .setThreshold(0.7) 会给出您自己报告的第一个错误（未显示）。

颠倒参数设置的顺序解决了问题（!!!），而且，使 getThreshold （单数）和 getThresholds （复数）都可操作（与您的情况相反):

blor2 = (LogisticRegression() .setThresholds([0.3, 0.7]) .setThreshold(0.7) ) blorModel2 = blor2.fit(bdf) # works OK blor2.getThreshold() # 0.7 blor2.getThresholds() # [0.30000000000000004, 0.7]

现在让我们转到多项式情况；我们将再次使用文档中的示例，使用 Spark Github repo 中的数据（它们也应该在本地可用，在您的 $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt 中，但我正在使用 Databricks 笔记本）；这是一个 3-class 的案例，标签在 {0.0, 1.0, 2.0}.

data_path ="/FileStore/tables/sample_multiclass_classification_data.txt" mdf = spark.read.format("libsvm").load(data_path)

与上面的二进制情况类似，我们的 thresholds（复数）的元素总和为 1，让我们要求 class 2:
的阈值为 0.8
mlor = (LogisticRegression() .setFamily("multinomial") .setThresholds([0, 0.2, 0.8]) .setThreshold(0.8) ) mlorModel= mlor.fit(mdf) # works OK mlor.getThreshold() # 0.8 mlor.getThresholds() # [0.19999999999999996, 0.8]

看起来不错，但让我们在（训练）数据集中请求一个预测：

mlorModel.transform(mdf).show(truncate=False)

我只挑出了一行 - 它应该是完整输出末尾的第二行：

+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ |label|features |rawPrediction |probability |prediction| +-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ [...] |0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 | [...] +-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+

向右滚动，您会看到尽管此处 class 2.0 的预测低于我们设置的阈值(0.8)，该行确实被预测为 2.0 - 与上面演示的二进制情况相反...

那么，怎么办？只需删除所有与阈值相关的语句；你不需要它们——甚至 setFamily 也是不必要的，因为算法会自行检测到你有超过 2 个 class。这将给出与上述相同的结果：

mlor = LogisticRegression() # works OK - no family, no threshold(s)

至总结:

在二进制和多项式情况下，算法实际 return 是一个长度等于 classes 的概率向量，元素总和为 1 .

仅在二进制情况下，Spark 允许您更进一步，而不是天真地选择最高的 probability class 作为 prediction，而是应用用户定义的阈值；此设置可能有用，例如在数据不平衡的情况下。

此 threshold(s) 设置在 多项式 情况下实际上无效，其中 Spark 将始终 return作为 prediction 最高 probability 的 class。

尽管文档中有混乱（关于I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere（原文强调）：

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

虽然上述论证是针对二元情况的，但它也完全适用于多项式情况...

在 PySpark 多项逻辑回归中设置阈值

Set thresholds in PySpark multinomial logistic regression

machine-learning

logistic-regression

apache-spark

pyspark

apache-spark-ml