如果我在 HPC 上使用 Dask-Jobqueue,我还需要使用 Dask-ML 来 运行 scikit-learn 代码吗?
If I am using Dask-Jobqueue on a HPC, do I still need to use Dask-ML to run scikit-learn codes?
如果我在高性能计算机 (HPC) 上使用 Dask-Jobqueue,我还需要使用 Dask-ML(即 joblib.parallel_backend('dask'
)来 运行 scikit-learn 代码吗?
假设我有以下代码:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory='100GB',
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100)
from dask.distributed import Client
client = Client(cluster)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
import joblib
with joblib.parallel_backend('dask'):
grid_search.fit(X, y)
由于我在 HPC 上使用 Dask-Jobqueue(即,我连接到 HPC 的一个实例),当我运行我的代码时,我的所有代码都已经分发到一个集群(因为我指定了 cluster.scale(100)
)?如果是,那么我还需要上面使用 Dask-ML 的最后 3 行代码吗?或者我的代码可以是这样的:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory='100GB',
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100)
from dask.distributed import Client
client = Client(cluster)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
grid_search.fit(X, y)
自从我删除了 joblib.parallel_backend('dask')
之后,grid_search.fit(X, y)
上面的最后一行代码不会在任何 Dask 集群上 运行 吗?或者它仍然会 运行 在集群上,因为我之前已经声明 cluster.scale(100)
?
非常感谢。
Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask')?
正确。需要告知 Scikit-Learn 使用 Dask
Or will it still run on a cluster since I have earlier on declared cluster.scale(100)?
没有。 Dask 无法自动并行化您的代码。您需要告诉 Scikit-Learn 将 Dask 与 joblib 装饰器一起使用,或者使用 dask_ml
GridSearchCV
等效对象。
如果我在高性能计算机 (HPC) 上使用 Dask-Jobqueue,我还需要使用 Dask-ML(即 joblib.parallel_backend('dask'
)来 运行 scikit-learn 代码吗?
假设我有以下代码:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory='100GB',
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100)
from dask.distributed import Client
client = Client(cluster)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
import joblib
with joblib.parallel_backend('dask'):
grid_search.fit(X, y)
由于我在 HPC 上使用 Dask-Jobqueue(即,我连接到 HPC 的一个实例),当我运行我的代码时,我的所有代码都已经分发到一个集群(因为我指定了 cluster.scale(100)
)?如果是,那么我还需要上面使用 Dask-ML 的最后 3 行代码吗?或者我的代码可以是这样的:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory='100GB',
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100)
from dask.distributed import Client
client = Client(cluster)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
grid_search.fit(X, y)
自从我删除了 joblib.parallel_backend('dask')
之后,grid_search.fit(X, y)
上面的最后一行代码不会在任何 Dask 集群上 运行 吗?或者它仍然会 运行 在集群上,因为我之前已经声明 cluster.scale(100)
?
非常感谢。
Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask')?
正确。需要告知 Scikit-Learn 使用 Dask
Or will it still run on a cluster since I have earlier on declared cluster.scale(100)?
没有。 Dask 无法自动并行化您的代码。您需要告诉 Scikit-Learn 将 Dask 与 joblib 装饰器一起使用,或者使用 dask_ml
GridSearchCV
等效对象。