使用 GridSearchCV 进行逻辑回归
Logistic regression using GridSearchCV
我正在尝试找出如何将线性回归与 GridSearchCV 一起使用,但我遇到了一个严重的错误,我不知道这是 GridSearchCV 估计量不正确的问题还是我的问题"LogisticRegression" 设置不正确。我让它适用于随机森林和 knn,但我坚持使用这个实现。
我使用的是一个小数据集,这就是我想使用 liblinear 的原因(即使它是默认的,如文档中所述)。
tuned_parameters = {'C': [0.1, 0.5, 1, 5, 10, 50, 100]}
clf = GridSearchCV(LogisticRegression(solver='liblinear'), tuned_parameters, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)
和错误:
StratifiedShuffleSplit(n_splits=1, random_state=0, test_size=0.4,
train_size=None)
Traceback (most recent call last):
File "linearRegression.py", line 105, in <module>
clf.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_search.py", line 945, in fit
return self._fit(X, y, groups, ParameterGrid(self.param_grid))
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_search.py", line 564, in _fit
for parameters in parameter_iterable
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
self.results = batch()
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_validation.py", line 260, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_validation.py", line 288, in _score
score = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.py", line 91, in __call__
y_pred = estimator.predict(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 336, in predict
scores = self.decision_function(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 320, in decision_function
dense_output=True) + self.intercept_
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py", line 189, in safe_sparse_dot
return fast_dot(a, b)
TypeError: Cannot cast array data from dtype([('f0', 'f8'), ('f1','f8')]) to dtype('float64') according to the rule 'safe'
我阅读了文档:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
和
感谢您的帮助。
编辑:
X 和 Y 的形状:
X = np.array(Xlist,np.dtype('float,float')) #-> two floats as features
y = np.array(ylist,np.dtype('int')) #-> label 0 or 1
示例:
X_train 是
[[(0.0, 0.0) (3.85, 0.0)] [(3.6, 0.0) (2.45, 0.0)] [(1.1, 0.0)
(1.35, 0.0)] [(3.7, 0.0) (1.85, 0.0)]]
Y_train 是
[1 0 0 0 1 0 1 1]
会不会是您将 X 数据集输入为元组列表:(A,B),而不是数组列表:[A,B]?
我能够 运行 使用 scikit-learn==0.18.1 的以下代码:
## Libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
X = [[0.0, 0.0], [3.85, 0.0], [3.6, 0.0], [2.45, 0.0], [1.1, 0.0], [1.35, 0.0], [3.7, 0.0], [1.85, 0.0]]
y = [1, 0, 0, 0, 1, 0, 1, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
tuned_parameters = {'C': [0.1, 0.5, 1, 5, 10, 50, 100]}
clf = GridSearchCV(LogisticRegression(solver='liblinear'), tuned_parameters, cv=3, scoring="accuracy")
clf.fit(X_train, y_train)
注意:我不得不减少 GridSearchCV 的 cv 属性,因为没有足够大的数据集来分为 5 个部分。
好的我的一个朋友解决了它:
我正在使用:
X = np.array(Xlist,np.dtype('float,float'))
y = np.array(ylist,np.dtype('int'))
它不会很好地使用这个估计器,即使它使用这些分类器也是如此:
SVC(内核='rbf'), SVC(内核='linear'), SVC(内核='poly'), NeighborsClassifier(), DecisionTreeClassifier(), RandomForestClassifier()
所以我将这两行替换为:
X = np.asarray(Xlist)
y = np.asarray(ylist)
我正在尝试找出如何将线性回归与 GridSearchCV 一起使用,但我遇到了一个严重的错误,我不知道这是 GridSearchCV 估计量不正确的问题还是我的问题"LogisticRegression" 设置不正确。我让它适用于随机森林和 knn,但我坚持使用这个实现。
我使用的是一个小数据集,这就是我想使用 liblinear 的原因(即使它是默认的,如文档中所述)。
tuned_parameters = {'C': [0.1, 0.5, 1, 5, 10, 50, 100]}
clf = GridSearchCV(LogisticRegression(solver='liblinear'), tuned_parameters, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)
和错误:
StratifiedShuffleSplit(n_splits=1, random_state=0, test_size=0.4,
train_size=None)
Traceback (most recent call last):
File "linearRegression.py", line 105, in <module>
clf.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_search.py", line 945, in fit
return self._fit(X, y, groups, ParameterGrid(self.param_grid))
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_search.py", line 564, in _fit
for parameters in parameter_iterable
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
self.results = batch()
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_validation.py", line 260, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_validation.py", line 288, in _score
score = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.py", line 91, in __call__
y_pred = estimator.predict(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 336, in predict
scores = self.decision_function(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 320, in decision_function
dense_output=True) + self.intercept_
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py", line 189, in safe_sparse_dot
return fast_dot(a, b)
TypeError: Cannot cast array data from dtype([('f0', 'f8'), ('f1','f8')]) to dtype('float64') according to the rule 'safe'
我阅读了文档: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
和
感谢您的帮助。
编辑: X 和 Y 的形状:
X = np.array(Xlist,np.dtype('float,float')) #-> two floats as features y = np.array(ylist,np.dtype('int')) #-> label 0 or 1
示例: X_train 是
[[(0.0, 0.0) (3.85, 0.0)] [(3.6, 0.0) (2.45, 0.0)] [(1.1, 0.0) (1.35, 0.0)] [(3.7, 0.0) (1.85, 0.0)]]
Y_train 是
[1 0 0 0 1 0 1 1]
会不会是您将 X 数据集输入为元组列表:(A,B),而不是数组列表:[A,B]?
我能够 运行 使用 scikit-learn==0.18.1 的以下代码:
## Libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
X = [[0.0, 0.0], [3.85, 0.0], [3.6, 0.0], [2.45, 0.0], [1.1, 0.0], [1.35, 0.0], [3.7, 0.0], [1.85, 0.0]]
y = [1, 0, 0, 0, 1, 0, 1, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
tuned_parameters = {'C': [0.1, 0.5, 1, 5, 10, 50, 100]}
clf = GridSearchCV(LogisticRegression(solver='liblinear'), tuned_parameters, cv=3, scoring="accuracy")
clf.fit(X_train, y_train)
注意:我不得不减少 GridSearchCV 的 cv 属性,因为没有足够大的数据集来分为 5 个部分。
好的我的一个朋友解决了它:
我正在使用:
X = np.array(Xlist,np.dtype('float,float'))
y = np.array(ylist,np.dtype('int'))
它不会很好地使用这个估计器,即使它使用这些分类器也是如此:
SVC(内核='rbf'), SVC(内核='linear'), SVC(内核='poly'), NeighborsClassifier(), DecisionTreeClassifier(), RandomForestClassifier()
所以我将这两行替换为:
X = np.asarray(Xlist)
y = np.asarray(ylist)