具有自定义 Transformer 的并行 GridSearchCV 在 IPython 个笔记本中挂起
Parallel GridSearchCV with custom Transformers hangs in IPython Notebooks
我在 IPython Notebook 中有代码,它使用 sklearn 的 GridSearchCV,并行模型的 n_jobs = 4 到 select 参数。
在我将自定义转换器添加到管道之前,它工作正常。一旦我将自定义转换器添加到管道中,它就会启动 "hanging".. 即该过程永远不会完成,即使 CPU 使用率下降到零。
当我设置 n_jobs = 1 时,即使使用自定义转换器也能正常工作。
这是重现问题的代码(将其复制并粘贴到 IPython 笔记本单元格中):
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
iris = load_iris()
X = iris["data"]
y = iris["target"]
class DummyTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
cv = GridSearchCV(estimator=Pipeline(steps=[('dummy', DummyTransformer()),
('rf', RandomForestClassifier())]),
param_grid={"rf__n_estimators": [10, 100]},
scoring="f1_weighted",
cv=10,
n_jobs=2) # n_jobs = 1 works fine, but setting n_jobs = 2 makes the script run forever... :-(
cv.fit(X, y)
cv.grid_scores_
设置 n_jobs=1 它将起作用,将 n_jobs 设置为 >1 它将永远不会完成。
我使用 Anaconda 发行版附带的 IPython Notebook。 IPython 笔记本 v3.2,Python v3.4 Windows 8 x64。
PS.: 这是整个笔记本的要点https://gist.github.com/anonymous/95b65991e96f5361404c
PPS.: 我刚刚注意到 "ipython notebook" 进程在代码挂起时在控制台 window 输出以下错误:
Process SpawnPoolWorker-12:
Traceback (most recent call last):
File "C:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 363, in get
return recv()
File "C:\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'DummyTransformer' on <module '__main__' (built-in)>
经过一番谷歌搜索后,我发现了以下 sklearn 问题:
https://github.com/scikit-learn/scikit-learn/issues/2889
amueller 说:
"Try not defining the metric in the notebook, but in a separate file
and import it. I'd think that would fix it."
将 DummyTransformer 放入 utils.py 并在笔记本中使用 "from utils import *" 真的 "fixed" 它。不过,我宁愿称之为解决方法。
如果谁有better/real解决办法,请补充回答!
我在 IPython Notebook 中有代码,它使用 sklearn 的 GridSearchCV,并行模型的 n_jobs = 4 到 select 参数。
在我将自定义转换器添加到管道之前,它工作正常。一旦我将自定义转换器添加到管道中,它就会启动 "hanging".. 即该过程永远不会完成,即使 CPU 使用率下降到零。
当我设置 n_jobs = 1 时,即使使用自定义转换器也能正常工作。
这是重现问题的代码(将其复制并粘贴到 IPython 笔记本单元格中):
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
iris = load_iris()
X = iris["data"]
y = iris["target"]
class DummyTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
cv = GridSearchCV(estimator=Pipeline(steps=[('dummy', DummyTransformer()),
('rf', RandomForestClassifier())]),
param_grid={"rf__n_estimators": [10, 100]},
scoring="f1_weighted",
cv=10,
n_jobs=2) # n_jobs = 1 works fine, but setting n_jobs = 2 makes the script run forever... :-(
cv.fit(X, y)
cv.grid_scores_
设置 n_jobs=1 它将起作用,将 n_jobs 设置为 >1 它将永远不会完成。
我使用 Anaconda 发行版附带的 IPython Notebook。 IPython 笔记本 v3.2,Python v3.4 Windows 8 x64。
PS.: 这是整个笔记本的要点https://gist.github.com/anonymous/95b65991e96f5361404c
PPS.: 我刚刚注意到 "ipython notebook" 进程在代码挂起时在控制台 window 输出以下错误:
Process SpawnPoolWorker-12:
Traceback (most recent call last):
File "C:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 363, in get
return recv()
File "C:\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'DummyTransformer' on <module '__main__' (built-in)>
经过一番谷歌搜索后,我发现了以下 sklearn 问题: https://github.com/scikit-learn/scikit-learn/issues/2889
amueller 说:
"Try not defining the metric in the notebook, but in a separate file and import it. I'd think that would fix it."
将 DummyTransformer 放入 utils.py 并在笔记本中使用 "from utils import *" 真的 "fixed" 它。不过,我宁愿称之为解决方法。
如果谁有better/real解决办法,请补充回答!