sklearn.cross_validation 中的错误
Bug in sklearn.cross_validation
使用 LeaveOneOut
的 sklearn.cross_validation
中可能存在错误。
x_test
和 y_test
未在 LeaveOneOut
中使用。相反,验证是使用 x_train
和 y_train
.
完成的
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import LeaveOneOut, cross_val_predict
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
y = np.array([12,13,19,18,15])
clf = LinearRegression().fit(x,y)
cv = LeaveOneOut(len(y))
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
y_pred_USING_x_test = clf.predict(x_test)
y_pred_USING_x_train = clf.predict(x_train)
print 'y_pred_USING_x_test: ', y_pred_USING_x_test, 'y_pred_USING_x_train: ', y_pred_USING_x_train
y_pred_USING_x_test: [ 13.2] y_pred_USING_x_train: [ 14.3 15.4 16.5 17.6]
y_pred_USING_x_test: [ 14.3] y_pred_USING_x_train: [ 13.2 15.4 16.5 17.6]
y_pred_USING_x_test: [ 15.4] y_pred_USING_x_train: [ 13.2 14.3 16.5 17.6]
y_pred_USING_x_test: [ 16.5] y_pred_USING_x_train: [ 13.2 14.3 15.4 17.6]
y_pred_USING_x_test: [ 17.6] y_pred_USING_x_train: [ 13.2 14.3 15.4 16.5]
y_pred_USING_x_test
每次for循环都给一个值,没有意义!
y_pred_USING_x_train
就是用LeaveOneOut
找的。
以下代码的结果完全无关!
bug = cross_val_predict(clf, x, y, cv=cv)
print 'bug: ', bug
bug: [ 15. 14.85714286 14.5 15.85714286 21.5 ]
欢迎任何辩护。
做 clf = LinearRegression().fit(x,y)' after the for loop, it gives the same answer ascross_val_predict(clf, x, y, cv=cv)
它不再有错误。该程序正在为每个循环预测一个左边的样本。
Each sample is used once as a test set (singleton)
这意味着 x_test
将是一个元素的数组,而 clf.predict(x_test)
将 return 是一个(预测的)元素的数组。这在您的输出中可以看到。
x_train
将是没有为 x_test
选择的一个元素的训练集。这可以通过在 for 循环
中添加以下行来确认
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
if len(x_test)!=1 or ( len(x_train)+1!=len(x) ): # Confirmation
raise Exception
y_pred_USING_x_test = clf.predict(x_test)
y_pred_USING_x_train = clf.predict(x_train)
print 'predicting for',x_test,'and expecting',y_test, 'and got', y_pred_USING_x_test
print 'predicting for',x_train,'and expecting',y_train, 'and got', y_pred_USING_x_train
print
print
注意 这不是正确的验证,因为您是在同一数据上训练和测试您的模型。您应该在 for 循环的迭代中创建新的 LinearRegression
对象并使用 x_train
、y_train
对其进行训练。用它来预测 x_test
然后比较 y_test
和 y_pred_USING_x_test
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
y = np.array([12,13,19,18,15])
cv = LeaveOneOut(len(y))
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
if len(x_test)!=1 or ( len(x_train)+1!=len(x) ):
raise Exception
clf = LinearRegression()
clf.fit(x_train, y_train)
y_pred_USING_x_test = clf.predict(x_test)
print 'predicting for',x_test,'and expecting',y_test, 'and got', y_pred_USING_x_test
没有错误。两件事:
您正在执行交叉验证拆分,但您从未在训练集上进行训练!您需要在调用 predict()
之前调用 clf.fit(x_train, y_train)
才能使其按预期运行。
根据设计,LeaveOneOut
中的测试集是单个样本(即一个被排除在外),因此预测结果将也是一个单一的数字。 cross_val_predict()
函数是一个方便的例程,可以将这些单个输出拼接在一起。
一旦你考虑了这两件事,我相信你的代码输出会更有意义。
结果如下:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import LeaveOneOut, cross_val_predict
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
y = np.array([12,13,19,18,15])
clf = LinearRegression().fit(x,y)
cv = LeaveOneOut(len(y))
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
clf.fit(x_train, y_train) # <--------------- note added line!
y_pred_USING_x_test = clf.predict(x_test)
y_pred_USING_x_train = clf.predict(x_train)
print('y_pred_USING_x_test: ', y_pred_USING_x_test,
'y_pred_USING_x_train: ', y_pred_USING_x_train)
print()
print(cross_val_predict(clf, x, y, cv=cv))
输出:
y_pred_USING_x_test: [ 15.] y_pred_USING_x_train: [ 15.5 16. 16.5 17. ]
y_pred_USING_x_test: [ 14.85714286] y_pred_USING_x_train: [ 13.94285714 15.77142857 16.68571429 17.6 ]
y_pred_USING_x_test: [ 14.5] y_pred_USING_x_train: [ 12.3 13.4 15.6 16.7]
y_pred_USING_x_test: [ 15.85714286] y_pred_USING_x_train: [ 13.2 14.08571429 14.97142857 16.74285714]
y_pred_USING_x_test: [ 21.5] y_pred_USING_x_train: [ 11.9 14.3 16.7 19.1]
[ 15. 14.85714286 14.5 15.85714286 21.5 ]
如您所见,手动循环中的测试输出与 cross_val_predict()
的输出匹配。
使用 LeaveOneOut
的 sklearn.cross_validation
中可能存在错误。
x_test
和 y_test
未在 LeaveOneOut
中使用。相反,验证是使用 x_train
和 y_train
.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import LeaveOneOut, cross_val_predict
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
y = np.array([12,13,19,18,15])
clf = LinearRegression().fit(x,y)
cv = LeaveOneOut(len(y))
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
y_pred_USING_x_test = clf.predict(x_test)
y_pred_USING_x_train = clf.predict(x_train)
print 'y_pred_USING_x_test: ', y_pred_USING_x_test, 'y_pred_USING_x_train: ', y_pred_USING_x_train
y_pred_USING_x_test: [ 13.2] y_pred_USING_x_train: [ 14.3 15.4 16.5 17.6]
y_pred_USING_x_test: [ 14.3] y_pred_USING_x_train: [ 13.2 15.4 16.5 17.6]
y_pred_USING_x_test: [ 15.4] y_pred_USING_x_train: [ 13.2 14.3 16.5 17.6]
y_pred_USING_x_test: [ 16.5] y_pred_USING_x_train: [ 13.2 14.3 15.4 17.6]
y_pred_USING_x_test: [ 17.6] y_pred_USING_x_train: [ 13.2 14.3 15.4 16.5]
y_pred_USING_x_test
每次for循环都给一个值,没有意义!
y_pred_USING_x_train
就是用LeaveOneOut
找的。
以下代码的结果完全无关!
bug = cross_val_predict(clf, x, y, cv=cv)
print 'bug: ', bug
bug: [ 15. 14.85714286 14.5 15.85714286 21.5 ]
欢迎任何辩护。
做 clf = LinearRegression().fit(x,y)' after the for loop, it gives the same answer ascross_val_predict(clf, x, y, cv=cv)
它不再有错误。该程序正在为每个循环预测一个左边的样本。
Each sample is used once as a test set (singleton)
这意味着 x_test
将是一个元素的数组,而 clf.predict(x_test)
将 return 是一个(预测的)元素的数组。这在您的输出中可以看到。
x_train
将是没有为 x_test
选择的一个元素的训练集。这可以通过在 for 循环
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
if len(x_test)!=1 or ( len(x_train)+1!=len(x) ): # Confirmation
raise Exception
y_pred_USING_x_test = clf.predict(x_test)
y_pred_USING_x_train = clf.predict(x_train)
print 'predicting for',x_test,'and expecting',y_test, 'and got', y_pred_USING_x_test
print 'predicting for',x_train,'and expecting',y_train, 'and got', y_pred_USING_x_train
print
print
注意 这不是正确的验证,因为您是在同一数据上训练和测试您的模型。您应该在 for 循环的迭代中创建新的 LinearRegression
对象并使用 x_train
、y_train
对其进行训练。用它来预测 x_test
然后比较 y_test
和 y_pred_USING_x_test
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
y = np.array([12,13,19,18,15])
cv = LeaveOneOut(len(y))
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
if len(x_test)!=1 or ( len(x_train)+1!=len(x) ):
raise Exception
clf = LinearRegression()
clf.fit(x_train, y_train)
y_pred_USING_x_test = clf.predict(x_test)
print 'predicting for',x_test,'and expecting',y_test, 'and got', y_pred_USING_x_test
没有错误。两件事:
您正在执行交叉验证拆分,但您从未在训练集上进行训练!您需要在调用
predict()
之前调用clf.fit(x_train, y_train)
才能使其按预期运行。根据设计,
LeaveOneOut
中的测试集是单个样本(即一个被排除在外),因此预测结果将也是一个单一的数字。cross_val_predict()
函数是一个方便的例程,可以将这些单个输出拼接在一起。
一旦你考虑了这两件事,我相信你的代码输出会更有意义。
结果如下:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import LeaveOneOut, cross_val_predict
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
y = np.array([12,13,19,18,15])
clf = LinearRegression().fit(x,y)
cv = LeaveOneOut(len(y))
for train, test in cv:
x_train, y_train = x[train], y[train]
x_test, y_test = x[test], y[test]
clf.fit(x_train, y_train) # <--------------- note added line!
y_pred_USING_x_test = clf.predict(x_test)
y_pred_USING_x_train = clf.predict(x_train)
print('y_pred_USING_x_test: ', y_pred_USING_x_test,
'y_pred_USING_x_train: ', y_pred_USING_x_train)
print()
print(cross_val_predict(clf, x, y, cv=cv))
输出:
y_pred_USING_x_test: [ 15.] y_pred_USING_x_train: [ 15.5 16. 16.5 17. ]
y_pred_USING_x_test: [ 14.85714286] y_pred_USING_x_train: [ 13.94285714 15.77142857 16.68571429 17.6 ]
y_pred_USING_x_test: [ 14.5] y_pred_USING_x_train: [ 12.3 13.4 15.6 16.7]
y_pred_USING_x_test: [ 15.85714286] y_pred_USING_x_train: [ 13.2 14.08571429 14.97142857 16.74285714]
y_pred_USING_x_test: [ 21.5] y_pred_USING_x_train: [ 11.9 14.3 16.7 19.1]
[ 15. 14.85714286 14.5 15.85714286 21.5 ]
如您所见,手动循环中的测试输出与 cross_val_predict()
的输出匹配。