为什么我的交叉验证矩阵返回 nan sklearn

Question

我正在尝试对一个简单的线性回归模型进行交叉验证（特别是 LOOCV），但出于某种原因，在计算过程分数时我得到了所有条目的 nan。有谁知道为什么？

代码如下：

#use sklearn
from sklearn import model_selection
from sklearn.model_selection import KFold
#now using sklearn repeat linear regression with sklearn
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)

cv = model_selection.cross_val_score(lr,X,y,cv=len(X))

这是数据：

mpg cylinders   displacement    horsepower  weight  acceleration    year    origin  name
0   18.0    8   307.0   130 3504    12.0    70  1   chevrolet chevelle malibu
1   15.0    8   350.0   165 3693    11.5    70  1   buick skylark 320
2   18.0    8   318.0   150 3436    11.0    70  1   plymouth satellite
3   16.0    8   304.0   150 3433    12.0    70  1   amc rebel sst
4   17.0    8   302.0   140 3449    10.5    70  1   ford torino
... ... ... ... ... ... ... ... ... ...
387 27.0    4   140.0   86  2790    15.6    82  1   ford mustang gl
388 44.0    4   97.0    52  2130    24.6    82  2   vw pickup
389 32.0    4   135.0   84  2295    11.6    82  1   dodge rampage
390 28.0    4   120.0   79  2625    18.6    82  1   ford ranger
391 31.0    4   119.0   82  2720    19.4    82  1   chevy s-10
392 rows × 9 columns



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null int64
weight          392 non-null int64
acceleration    392 non-null float64
year            392 non-null int64
origin          392 non-null int64
name            392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB

Answer 1

如果你阅读了 cross_val_score 的插图：

scoring: string, callable, list/tuple, dict or None, default: None [....] If None, the estimator’s score method is used.

对于 LinearRegression()，这是预测的 R^2。但是当 n=1 时 R^2 没有意义。尝试类似均方误差的方法，下面我使用了 'neg_mean_squared_error'，它是 MSE 的负数，可从 sklearn.metrics.SCORERS.keys()

获得

import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
                 delimiter=r"\s+",header=None,
                 names=["mpg","cylinders","displacement","horsepower","weight",
                        "acceleration","model year","origin","car name"],
                   na_values=['?'])

lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)

model_selection.cross_val_score(lr,X,y,cv=len(X),scoring='neg_mean_squared_error')

为什么我的交叉验证矩阵返回 nan sklearn

why is my cross validation matrix returning nan sklearn

python

statistics

pandas

scikit-learn

cross-validation