是否可以在循环中或使用迭代器来拟合（）scikit-learn模型

Question

通常人们使用 scikit-learn 以这种方式训练模型：

from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

只要用户的内存足够大以容纳整个数据集，它就可以正常工作。对我来说，困境就是这个——数据集对我的记忆来说太大了。我目前的解决方案是扩大我机器的虚拟内存，我已经因为虚拟内存太多而使系统变得非常慢——所以我开始考虑是否可以为 fit() 方法提供像这样分批采样（答案是否定的，请继续阅读并停止提醒我答案是否定的）：

clf = gbc()
for i in range(X_train.shape[0]):
    clf.fit(X_train[i], y_train[i])

这样我就可以只在需要的时候从硬盘读取训练集。我读了 sklearn's manual，在我看来它不支持这个：

Calling fit() more than once will overwrite what was learned by any previous fit()

那么，这可能吗？

Answer 1

如评论部分和文档中所述，这在 scikit-learn 中不起作用。但是，您可以使用 river（这是用于 online/streaming 机器学习的 python 包）。这个包应该很适合有问题的你。

下面是使用 river 训练 LinearRegression 的示例。

from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing

dataset = datasets.TrumpApproval()

model = (
    preprocessing.StandardScaler() |
    linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()

for x, y, in dataset:
    y_pred = model.predict_one(x)

    # Update the running metric with the prediction and ground truth value
    metric.update(y, y_pred)

    # Train the model with the new sample
    model.learn_one(x, y)

Answer 2

您的问题中不清楚机器学习中的哪些步骤对您来说很慢。正如 riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.

手册中所述

所以使用你的例子假设我们使用 stochastic gradient descent classifier:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification

X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes  = list(set(y))

for ix in np.split(np.arange(0,X.shape[0]),100):
    clf.partial_fit(X[ix,:],y[ix],classes = all_classes)

Answer 3

阅读了@StupidWolf 在此post 中提到的官方手册的6. Strategies to scale computationally: bigger data 部分后，我意识到这个问题比表面看起来更重要。

真正的难点在于很多模型的设计。

以随机森林为例，与更简单的决策树相比，用于提高其性能的最重要技术之一是bagging的应用，这意味着算法必须选择一些随机样本从整个数据集中构建几个弱学习器作为随机森林的基础。这意味着用一个接一个的样本喂养模型不适用于此设计。

尽管 scikit-learn 仍然可以为最终用户定义一个接口来实现，以便 scikit-learn 可以通过调用此接口来选择随机样本，最终用户将决定他们的方式接口的实现即将通过扫描硬盘驱动器上的数据集来 return 所需的数据，它变得比我最初想象的要复杂得多，并且考虑到 IO-heavy “full”，性能提升可能不是很显着table 扫描”（在数据库术语中）经常需要。

是否可以在循环中或使用迭代器来拟合（）scikit-learn模型

Is it possible to fit() a scikit-learn model in a loop or with an iterator

python

scikit-learn