具有自定义 Transformer 的并行 GridSearchCV 在 IPython 个笔记本中挂起

Question

我在 IPython Notebook 中有代码，它使用 sklearn 的 GridSearchCV，并行模型的 n_jobs = 4 到 select 参数。

在我将自定义转换器添加到管道之前，它工作正常。一旦我将自定义转换器添加到管道中，它就会启动 "hanging".. 即该过程永远不会完成，即使 CPU 使用率下降到零。

当我设置 n_jobs = 1 时，即使使用自定义转换器也能正常工作。

这是重现问题的代码（将其复制并粘贴到 IPython 笔记本单元格中）：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline

iris = load_iris()

X = iris["data"]
y = iris["target"]

class DummyTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X

cv = GridSearchCV(estimator=Pipeline(steps=[('dummy', DummyTransformer()),
                                            ('rf', RandomForestClassifier())]),
                  param_grid={"rf__n_estimators": [10, 100]},
                  scoring="f1_weighted",
                  cv=10,
                  n_jobs=2) # n_jobs = 1 works fine, but setting n_jobs = 2 makes the script run forever... :-(
cv.fit(X, y)

cv.grid_scores_

设置 n_jobs=1 它将起作用，将 n_jobs 设置为 >1 它将永远不会完成。

我使用 Anaconda 发行版附带的 IPython Notebook。 IPython 笔记本 v3.2，Python v3.4 Windows 8 x64。

PS.: 这是整个笔记本的要点https://gist.github.com/anonymous/95b65991e96f5361404c

PPS.: 我刚刚注意到 "ipython notebook" 进程在代码挂起时在控制台 window 输出以下错误：

Process SpawnPoolWorker-12:
Traceback (most recent call last):
  File "C:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()
  File "C:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 363, in get
    return recv()
  File "C:\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
    return ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'DummyTransformer' on <module '__main__' (built-in)>

Answer 1

经过一番谷歌搜索后，我发现了以下 sklearn 问题： https://github.com/scikit-learn/scikit-learn/issues/2889

amueller 说：

"Try not defining the metric in the notebook, but in a separate file and import it. I'd think that would fix it."

将 DummyTransformer 放入 utils.py 并在笔记本中使用 "from utils import *" 真的 "fixed" 它。不过，我宁愿称之为解决方法。

如果谁有better/real解决办法，请补充回答！

具有自定义 Transformer 的并行 GridSearchCV 在 IPython 个笔记本中挂起

Parallel GridSearchCV with custom Transformers hangs in IPython Notebooks

python-3.x

scikit-learn

ipython-notebook