为什么我的交叉验证矩阵返回 nan sklearn
why is my cross validation matrix returning nan sklearn
我正在尝试对一个简单的线性回归模型进行交叉验证(特别是 LOOCV),但出于某种原因,在计算过程分数时我得到了所有条目的 nan
。有谁知道为什么?
代码如下:
#use sklearn
from sklearn import model_selection
from sklearn.model_selection import KFold
#now using sklearn repeat linear regression with sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)
cv = model_selection.cross_val_score(lr,X,y,cv=len(X))
这是数据:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
387 27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl
388 44.0 4 97.0 52 2130 24.6 82 2 vw pickup
389 32.0 4 135.0 84 2295 11.6 82 1 dodge rampage
390 28.0 4 120.0 79 2625 18.6 82 1 ford ranger
391 31.0 4 119.0 82 2720 19.4 82 1 chevy s-10
392 rows × 9 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
mpg 392 non-null float64
cylinders 392 non-null int64
displacement 392 non-null float64
horsepower 392 non-null int64
weight 392 non-null int64
acceleration 392 non-null float64
year 392 non-null int64
origin 392 non-null int64
name 392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB
如果你阅读了 cross_val_score 的插图:
scoring: string, callable, list/tuple, dict or None, default: None
[....] If None, the estimator’s score method is used.
对于 LinearRegression()
,这是预测的 R^2。但是当 n=1 时 R^2 没有意义。尝试类似均方误差的方法,下面我使用了 'neg_mean_squared_error'
,它是 MSE 的负数,可从 sklearn.metrics.SCORERS.keys()
获得
import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
delimiter=r"\s+",header=None,
names=["mpg","cylinders","displacement","horsepower","weight",
"acceleration","model year","origin","car name"],
na_values=['?'])
lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)
model_selection.cross_val_score(lr,X,y,cv=len(X),scoring='neg_mean_squared_error')
我正在尝试对一个简单的线性回归模型进行交叉验证(特别是 LOOCV),但出于某种原因,在计算过程分数时我得到了所有条目的 nan
。有谁知道为什么?
代码如下:
#use sklearn
from sklearn import model_selection
from sklearn.model_selection import KFold
#now using sklearn repeat linear regression with sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)
cv = model_selection.cross_val_score(lr,X,y,cv=len(X))
这是数据:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
387 27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl
388 44.0 4 97.0 52 2130 24.6 82 2 vw pickup
389 32.0 4 135.0 84 2295 11.6 82 1 dodge rampage
390 28.0 4 120.0 79 2625 18.6 82 1 ford ranger
391 31.0 4 119.0 82 2720 19.4 82 1 chevy s-10
392 rows × 9 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
mpg 392 non-null float64
cylinders 392 non-null int64
displacement 392 non-null float64
horsepower 392 non-null int64
weight 392 non-null int64
acceleration 392 non-null float64
year 392 non-null int64
origin 392 non-null int64
name 392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB
如果你阅读了 cross_val_score 的插图:
scoring: string, callable, list/tuple, dict or None, default: None [....] If None, the estimator’s score method is used.
对于 LinearRegression()
,这是预测的 R^2。但是当 n=1 时 R^2 没有意义。尝试类似均方误差的方法,下面我使用了 'neg_mean_squared_error'
,它是 MSE 的负数,可从 sklearn.metrics.SCORERS.keys()
import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
delimiter=r"\s+",header=None,
names=["mpg","cylinders","displacement","horsepower","weight",
"acceleration","model year","origin","car name"],
na_values=['?'])
lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)
model_selection.cross_val_score(lr,X,y,cv=len(X),scoring='neg_mean_squared_error')