Spark MlLib 线性回归(线性最小二乘法)给出随机结果
Spark MlLib linear regression (Linear least squares) giving random results
我是 spark 和机器学习的新手。
我已经成功地遵循了一些 Mllib 教程,但我无法使它正常工作:
我在这里找到了示例代码:
https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
(LinearRegressionWithSGD 部分)
代码如下:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")
(这正是网站上的内容)
结果是
training Mean Squared Error = 6.2087803138063045
和
valuesAndPreds.collect
给予
Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077),
(-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544),
(-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276),
(0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985),
(1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792),
(1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306),
(1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344),
(1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454),
(1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635),
(1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634),
(1.8000583,0.7812664902440078), (1.8484548,0.94605507153074),
(1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),...
我的问题是预测看起来完全随机(而且是错误的),并且由于它是网站示例的完美副本,具有相同的输入数据(训练集),我不知道去哪里看,是吗遗漏了什么?
请给我一些关于在哪里搜索的建议或线索,我可以阅读和试验。
谢谢
线性回归基于 SGD,需要调整步长,有关详细信息,请参阅 http://spark.apache.org/docs/latest/mllib-optimization.html。
在您的示例中,如果将步长设置为 0.1,您将获得更好的结果 (MSE = 0.5)。
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
有关更真实的数据集的另一个示例,请参阅
正如 zero323 所解释的那样,将拦截设置为 true 将解决问题。如果未设置为 true,您的回归线将被迫通过原点,这在这种情况下是不合适的。 (不确定,为什么这个没有包含在示例代码中)
因此,要解决您的问题,请更改代码 (Pyspark) 中的以下行:
model = LinearRegressionWithSGD.train(parsedData, numIterations)
到
model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
虽然没有明确提及,但这也是上述问题中 'selvinsource' 中的代码有效的原因。在此示例中,更改步长没有多大帮助。
我是 spark 和机器学习的新手。 我已经成功地遵循了一些 Mllib 教程,但我无法使它正常工作:
我在这里找到了示例代码: https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
(LinearRegressionWithSGD 部分)
代码如下:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")
(这正是网站上的内容)
结果是
training Mean Squared Error = 6.2087803138063045
和
valuesAndPreds.collect
给予
Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077),
(-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544),
(-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276),
(0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985),
(1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792),
(1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306),
(1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344),
(1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454),
(1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635),
(1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634),
(1.8000583,0.7812664902440078), (1.8484548,0.94605507153074),
(1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),...
我的问题是预测看起来完全随机(而且是错误的),并且由于它是网站示例的完美副本,具有相同的输入数据(训练集),我不知道去哪里看,是吗遗漏了什么?
请给我一些关于在哪里搜索的建议或线索,我可以阅读和试验。
谢谢
线性回归基于 SGD,需要调整步长,有关详细信息,请参阅 http://spark.apache.org/docs/latest/mllib-optimization.html。
在您的示例中,如果将步长设置为 0.1,您将获得更好的结果 (MSE = 0.5)。
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
有关更真实的数据集的另一个示例,请参阅
正如 zero323
因此,要解决您的问题,请更改代码 (Pyspark) 中的以下行:
model = LinearRegressionWithSGD.train(parsedData, numIterations)
到
model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
虽然没有明确提及,但这也是上述问题中 'selvinsource' 中的代码有效的原因。在此示例中,更改步长没有多大帮助。