为什么 scikit-learn 中的 GridSearchCV 产生这么多线程

Question

这是我当前运行 GridSearch 的 pstree 输出，我很想看看正在进行的过程，但有些事情我还无法解释。

 ├─bash─┬─perl───20*[bash───python─┬─5*[python───31*[{python}]]]
 │      │                          └─11*[{python}]]
 │      └─tee
 └─bash───pstree

我删除了 unrelated.Curly 大括号表示线程的内容。

perl 的出现是因为我使用 parallel -j 20 开始了我的 python 工作。可以看到，20*确实显示有20个进程。
每个 python 进程之前的 bash 进程是由于使用 source activate venv.
在每个 python 进程中，还有另外 5 个 python 进程 (5*) 产生。这是因为我指定了n_jobs=5到GridSearchCV.

我的理解到此结束

问题：谁能解释为什么还有 11 个 python 线程（11*[{python}]）和网格搜索，以及 31 个 python线程 (31*[{python}]) 在 5 个网格搜索作业中产生？

更新: 添加调用GridSearchCV

的代码

Cs = 10 ** np.arange(-2, 2, 0.1)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = LogisticRegression()
gs = GridSearchCV(
    clf,
    param_grid={'C': Cs, 'penalty': ['l1'],
                'tol': [1e-10], 'solver': ['liblinear']},
    cv=skf,
    scoring='neg_log_loss',
    n_jobs=5,
    verbose=1,
    refit=True)
gs.fit(Xs, ys)

更新(2017-09-27):

我总结了一个 test code 要点，如果有兴趣，您可以轻松重现。

我在 Mac Pro 和多台 linux 机器上测试了相同的代码，并重现了@igrinis 的结果，但仅在 Mac Pro 上。在 linux 机器上，我得到的数字与以前不同，但始终如一。因此生成的线程数可能取决于 GridSearchCV 的特定数据馈送。

python─┬─5*[python───31*[{python}]]
       └─3*[{python}]

请注意 homebrew/linuxbrew 在 Mac Pro 和 linux 机器上安装的 pstree 是不同的。这里我 post 我使用的确切版本：

Mac:

pstree $Revision: 2.39 $ by Fred Hucht (C) 1993-2015
EMail: fred AT thp.uni-due.de

Linux:

pstree (PSmisc) 22.20
Copyright (C) 1993-2009 Werner Almesberger and Craig Small

Mac 版本似乎没有显示线程的选项，我认为这可能是结果中看不到它们的原因。我还没有找到一种方法来轻松检查 Mac Pro 上的线程。如果你碰巧知道一个方法，请评论。

更新 (2017-10-12)

在另一组实验中，我确认设置环境变量 OMP_NUM_THREADS 会有所不同。

在 export OMP_NUM_THREADS=1 之前，有许多（在本例中为 63 个）线程没有明确使用，如上所述：

bash───python─┬─23*[python───63*[{python}]]
              └─3*[{python}]

这里不用linuxparallel。 n_jobs=23.

在 export OMP_NUM_THREADS=1 之后，没有线程产生，但是 3 个 Python 进程仍然存在，我仍然不知道它们的用途。

bash───python─┬─23*[python]
              └─3*[{python}]

我最初遇到 OMP_NUM_THREADS 因为它会导致我的一些 GridSearchCV 作业出错，错误消息是这样的

OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

Answer 1

来自 sklearn.GridSearchCV 文档：

n_jobs : int, default=1 Number of jobs to run in parallel.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

如果我正确理解文档，GridSearchCV 会生成一堆线程作为网格点数，并且只会同时运行 n_jobs。我认为第 31 号是您 40 个可能值的某种上限。尝试使用 pre_dispatch 参数的值。

另外11个线程我认为与GridSearchCV本身无关，因为它显示在同一级别。我认为这是其他命令的遗留问题。

顺便说一下，我在 Mac 上没有观察到这种行为（只看到 GridSearchCV 产生了 5 个进程，正如人们所期望的那样）所以它可能来自不兼容的库。尝试手动更新 sklearn 和 numpy。

这是我的 pstree 输出（出于隐私原因删除了部分路径）：

 └─┬= 00396 *** -fish
   └─┬= 21743 *** python /Users/***/scratch_5.py
     ├─── 21775 *** python /Users/***/scratch_5.py
     ├─── 21776 *** python /Users/***/scratch_5.py
     ├─── 21777 *** python /Users/***/scratch_5.py
     ├─── 21778 *** python /Users/***/scratch_5.py
     └─── 21779 *** python /Users/***/scratch_5.py

第二条评论的回答：

这实际上是您的代码。刚刚生成可分离的1d二class问题：

N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )

10 万个样本足以让 CPU 忙上一分钟。

为什么 scikit-learn 中的 GridSearchCV 产生这么多线程

Why GridSearchCV in scikit-learn spawn so many threads

python

multithreading

scikit-learn

grid-search