逻辑回归模型原始预测领域背后的 pyspark 2.2.0 概念
pyspark 2.2.0 concept behind raw predictions field of logistic regression model
我试图理解 Pyspark 中逻辑回归模型生成的输出的概念。
谁能解释一下从逻辑回归模型生成的 rawPrediction
字段计算背后的概念?
谢谢
在老版本的Sparkjavadocs中(例如1.5.x),曾有如下解释:
The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
在以后的版本中没有了,但是在Scala中还是可以找到的source code。
无论如何,撇开任何不幸的措辞不谈,Spark ML 中的 rawPrecictions
对于逻辑回归案例,就是世界其他地方所说的 logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x))
.
这是玩具数据的示例:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)
结果如下:
+---------+----------------------------------------+----------------------------------------+----------+
|features | rawPrediction | probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]| 0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613] | 1.0 |
+---------+----------------------------------------+----------------------------------------+----------+
现在让我们确认 rawPrediction
的逻辑函数给出 probability
列:
import numpy as np
x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])
x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])
即确实如此
因此,总结所有三 (3) 个输出列:
rawPrediction
是逻辑回归分类器的原始输出(长度等于类个数的数组)
probability
是对rawPrediction
(长度等于rawPrediction
的数组)应用logistic函数的结果
prediction
是数组probability
取最大值的参数,给出最可能的标签(单数)
我试图理解 Pyspark 中逻辑回归模型生成的输出的概念。
谁能解释一下从逻辑回归模型生成的 rawPrediction
字段计算背后的概念?
谢谢
在老版本的Sparkjavadocs中(例如1.5.x),曾有如下解释:
The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
在以后的版本中没有了,但是在Scala中还是可以找到的source code。
无论如何,撇开任何不幸的措辞不谈,Spark ML 中的 rawPrecictions
对于逻辑回归案例,就是世界其他地方所说的 logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x))
.
这是玩具数据的示例:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)
结果如下:
+---------+----------------------------------------+----------------------------------------+----------+
|features | rawPrediction | probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]| 0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613] | 1.0 |
+---------+----------------------------------------+----------------------------------------+----------+
现在让我们确认 rawPrediction
的逻辑函数给出 probability
列:
import numpy as np
x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])
x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])
即确实如此
因此,总结所有三 (3) 个输出列:
rawPrediction
是逻辑回归分类器的原始输出(长度等于类个数的数组)probability
是对rawPrediction
(长度等于rawPrediction
的数组)应用logistic函数的结果prediction
是数组probability
取最大值的参数,给出最可能的标签(单数)