为什么 sklearn Random Forest 预测一个样本比预测 n 个样本花费相同的时间

Question

我在 Python 3.6 上使用 sklearn，我注意到将一个样本预测为 1D numpy 数组所花费的运行时间与将 n 个样本预测为 2D numpy 数组所花费的时间相同随机森林（均为 ~0.1 秒）。看起来 sklearn 需要一定的时间来首先在每个预测步骤中设置树，然后立即进行预测。这可以解释为什么运行大型二维数组的预测时间与一维数组相同？

这是我训练模型的代码：

clf = RandomForestClassifier(n_estimators=1, #or > 1 
        n_jobs=-1,
        random_state=2,
        max_depth=15,
        min_samples_leaf=1,
        verbose=0,
        max_features='auto'
        )

clf.fit(X_train, y_train)

with open('classifier.pkl', 'wb') as fid:
   cPickle.dump(clf, fid)

在我的例子中，我必须像这样在一个循环中一个一个地实时预测：

with open('classifier.pkl', 'rb') as fid:
   clf = cPickle.load(fid)

for s in samples:
   #my feature extraction method
   pred = clf.predict(feature) #feature is a 1D np array containing features 
                               #computed for the sample s

是不是我用错了？还是 sklearn 只是没有针对一项一项的预测进行优化？

Answer 1

你说得对，sklearn 针对矢量运算进行了高度优化。您正在正确使用它。如果你这样做，你应该会看到显着的加速：

features = np.zeros((len(samples), n_features))
for i, s in enumerate(samples):
   features[i] = feature_extraction(s)
preds = clf.predict(features)

为什么 sklearn Random Forest 预测一个样本比预测 n 个样本花费相同的时间

Why sklearn Random Forest takes the same time to predict one sample than n samples

python

runtime

machine-learning

random-forest

scikit-learn