Sklearn 的克隆在 Red Hat 的 python 中显示了多处理的意外行为
Sklearn's clone shows unexpected behavior with multiprocessing in python on Red Hat
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import clone
import multiprocessing
import functools
import numpy as np
def train_model(n_estimators, base_model, X, y):
model = clone(base_model)
model.set_params(n_estimators=n_estimators)
model.fit(X,y)
return model
class A():
def __init__(self, random_state, jobs, **kwargs):
self.model = RandomForestClassifier(oob_score=True, random_state=random_state, **kwargs)
self.jobs = jobs
def fit(self, X, y):
job_pool = multiprocessing.Pool(self.jobs)
n_estimators = [100]
for output in job_pool.imap_unordered(functools.partial(train_model,
base_model=self.model,
X=X,
y=y),n_estimators):
model = output
job_pool.terminate()
self.model = model
if __name__ == '__main__':
np.random.seed(42)
X, y = make_classification(n_samples=500,n_informative=6,n_redundant=6, flip_y=0.1)
print "Class A"
for i in range(5):
base_model = A(random_state=None, jobs=1)
base_model.fit(X,y)
print base_model.model.oob_score_
print "Bare RF"
base_model = RandomForestClassifier(n_estimators=500, max_features=2, oob_score=True, random_state=None)
for i in range(5):
model = clone(base_model)
model.fit(X,y)
print model.oob_score_
Windows 7 机器(Python 2.7.13)上的输出:
(pip 冻结:numpy==1.11.0,scikit-image==0.12.3,scikit-learn==0.17,scipy==0.17.0)
Class一个
0.82
0.826
0.832
0.822
0.816
裸射频
0.814
0.81
0.818
0.818
0.818
Red Hat 4.8.3-9 Linux 机器 (Python 2.7.5) 上的输出:
(pip freeze: numpy==1.11.0, scikit-learn==0.17, scipy==0.17.0, sklearn==0.0)
Class一个
0.818
0.818
0.818
0.818
0.818
裸射频
0.814
0.81
0.818
0.818
0.818
所以,总结一下:
在 Linux 中,"Class A"(使用多处理)似乎在训练完全相同的模型,因此得分相同。而我期望的行为将是分数不重合的 "Bare RF" 部分之一(这是一种随机算法)。在Windows(Pycharm)中,问题无法重现。
你能帮忙吗?
大编辑:创建了一个可重现的代码示例。
解决办法是在“train_model”里面加一个reseed,并行执行。
def train_model(n_estimators, base_model, X, y):
np.random.seed()
model = clone(base_model)
model.set_params(n_estimators=n_estimators)
model.fit(X,y)
return model
推理:
What happens is that on Unix every worker process inherits the same state of the random number generator from the parent process. This is why they generate identical pseudo-random sequences.
实际上启动工作进程的是多处理,这就是相关的原因。所以这不是 scikit-learn 克隆问题。
我找到了答案here and here
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import clone
import multiprocessing
import functools
import numpy as np
def train_model(n_estimators, base_model, X, y):
model = clone(base_model)
model.set_params(n_estimators=n_estimators)
model.fit(X,y)
return model
class A():
def __init__(self, random_state, jobs, **kwargs):
self.model = RandomForestClassifier(oob_score=True, random_state=random_state, **kwargs)
self.jobs = jobs
def fit(self, X, y):
job_pool = multiprocessing.Pool(self.jobs)
n_estimators = [100]
for output in job_pool.imap_unordered(functools.partial(train_model,
base_model=self.model,
X=X,
y=y),n_estimators):
model = output
job_pool.terminate()
self.model = model
if __name__ == '__main__':
np.random.seed(42)
X, y = make_classification(n_samples=500,n_informative=6,n_redundant=6, flip_y=0.1)
print "Class A"
for i in range(5):
base_model = A(random_state=None, jobs=1)
base_model.fit(X,y)
print base_model.model.oob_score_
print "Bare RF"
base_model = RandomForestClassifier(n_estimators=500, max_features=2, oob_score=True, random_state=None)
for i in range(5):
model = clone(base_model)
model.fit(X,y)
print model.oob_score_
Windows 7 机器(Python 2.7.13)上的输出:
(pip 冻结:numpy==1.11.0,scikit-image==0.12.3,scikit-learn==0.17,scipy==0.17.0)
Class一个
0.82
0.826
0.832
0.822
0.816
裸射频
0.814
0.81
0.818
0.818
0.818
Red Hat 4.8.3-9 Linux 机器 (Python 2.7.5) 上的输出:
(pip freeze: numpy==1.11.0, scikit-learn==0.17, scipy==0.17.0, sklearn==0.0)
Class一个
0.818
0.818
0.818
0.818
0.818
裸射频
0.814
0.81
0.818
0.818
0.818
所以,总结一下:
在 Linux 中,"Class A"(使用多处理)似乎在训练完全相同的模型,因此得分相同。而我期望的行为将是分数不重合的 "Bare RF" 部分之一(这是一种随机算法)。在Windows(Pycharm)中,问题无法重现。
你能帮忙吗?
大编辑:创建了一个可重现的代码示例。
解决办法是在“train_model”里面加一个reseed,并行执行。
def train_model(n_estimators, base_model, X, y):
np.random.seed()
model = clone(base_model)
model.set_params(n_estimators=n_estimators)
model.fit(X,y)
return model
推理:
What happens is that on Unix every worker process inherits the same state of the random number generator from the parent process. This is why they generate identical pseudo-random sequences.
实际上启动工作进程的是多处理,这就是相关的原因。所以这不是 scikit-learn 克隆问题。
我找到了答案here and here