Sklearn 的克隆在 Red Hat 的 python 中显示了多处理的意外行为

Question

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import clone
import multiprocessing
import functools
import numpy as np

def train_model(n_estimators, base_model, X, y):
    model = clone(base_model)
    model.set_params(n_estimators=n_estimators)
    model.fit(X,y)
    return model


class A():
    def __init__(self, random_state, jobs, **kwargs):
        self.model = RandomForestClassifier(oob_score=True, random_state=random_state, **kwargs)
        self.jobs = jobs


    def fit(self, X, y):
        job_pool = multiprocessing.Pool(self.jobs)
        n_estimators = [100]
        for output in job_pool.imap_unordered(functools.partial(train_model,
                                                                base_model=self.model,
                                                                X=X,
                                                                y=y),n_estimators):
            model = output
        job_pool.terminate()
        self.model = model


if __name__ == '__main__':

    np.random.seed(42)
    X, y = make_classification(n_samples=500,n_informative=6,n_redundant=6, flip_y=0.1)

    print "Class A"
    for i in range(5):
        base_model = A(random_state=None, jobs=1)
        base_model.fit(X,y)
        print base_model.model.oob_score_

    print "Bare RF"
    base_model = RandomForestClassifier(n_estimators=500, max_features=2, oob_score=True, random_state=None)
    for i in range(5):
        model = clone(base_model)
        model.fit(X,y)
        print model.oob_score_

Windows 7 机器（Python 2.7.13）上的输出：
（pip 冻结：numpy==1.11.0，scikit-image==0.12.3，scikit-learn==0.17，scipy==0.17.0）

Class一个
0.82
0.826
0.832
0.822
0.816

裸射频
0.814
0.81
0.818
0.818
0.818

Red Hat 4.8.3-9 Linux 机器 (Python 2.7.5) 上的输出：
(pip freeze: numpy==1.11.0, scikit-learn==0.17, scipy==0.17.0, sklearn==0.0)
Class一个
0.818
0.818
0.818
0.818
0.818

裸射频
0.814
0.81
0.818
0.818
0.818

所以，总结一下：
在 Linux 中，"Class A"（使用多处理）似乎在训练完全相同的模型，因此得分相同。而我期望的行为将是分数不重合的 "Bare RF" 部分之一（这是一种随机算法）。在Windows（Pycharm）中，问题无法重现。

你能帮忙吗？

大编辑：创建了一个可重现的代码示例。

Answer 1

解决办法是在“train_model”里面加一个reseed，并行执行。

def train_model(n_estimators, base_model, X, y):
    np.random.seed()
    model = clone(base_model)
    model.set_params(n_estimators=n_estimators)
    model.fit(X,y)
    return model

推理：

What happens is that on Unix every worker process inherits the same state of the random number generator from the parent process. This is why they generate identical pseudo-random sequences.

实际上启动工作进程的是多处理，这就是相关的原因。所以这不是 scikit-learn 克隆问题。

我找到了答案here and here

Sklearn 的克隆在 Red Hat 的 python 中显示了多处理的意外行为

Sklearn's clone shows unexpected behavior with multiprocessing in python on Red Hat

clone

multiprocessing

scikit-learn

python-multiprocessing