如果我在 HPC 上使用 Dask-Jobqueue,我还需要使用 Dask-ML 来 运行 scikit-learn 代码吗?

If I am using Dask-Jobqueue on a HPC, do I still need to use Dask-ML to run scikit-learn codes?

如果我在高性能计算机 (HPC) 上使用 Dask-Jobqueue,我还需要使用 Dask-ML(即 joblib.parallel_backend('dask')来 运行 scikit-learn 代码吗?

假设我有以下代码:

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,   
                     memory='100GB',   
                     project='P48500028',   
                     queue='premium',   
                     interface='ib0',
                     walltime='02:00:00')

cluster.scale(100)  
                   
from dask.distributed import Client
client = Client(cluster)   


from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)

param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)


import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)

由于我在 HPC 上使用 Dask-Jobqueue(即,我连接到 HPC 的一个实例),当我运行我的代码时,我的所有代码都已经分发到一个集群(因为我指定了 cluster.scale(100))?如果是,那么我还需要上面使用 Dask-ML 的最后 3 行代码吗?或者我的代码可以是这样的:

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,   
                     memory='100GB',   
                     project='P48500028',   
                     queue='premium',   
                     interface='ib0',
                     walltime='02:00:00')

cluster.scale(100)  
                   
from dask.distributed import Client
client = Client(cluster)   


from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)

param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)

grid_search.fit(X, y)

自从我删除了 joblib.parallel_backend('dask') 之后,grid_search.fit(X, y) 上面的最后一行代码不会在任何 Dask 集群上 运行 吗?或者它仍然会 运行 在集群上,因为我之前已经声明 cluster.scale(100)?

非常感谢。

Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask')?

正确。需要告知 Scikit-Learn 使用 Dask

Or will it still run on a cluster since I have earlier on declared cluster.scale(100)?

没有。 Dask 无法自动并行化您的代码。您需要告诉 Scikit-Learn 将 Dask 与 joblib 装饰器一起使用,或者使用 dask_ml GridSearchCV 等效对象。