如何使用 Apache Spark 执行简单的网格搜索
How to perform simple grid search with Apache Spark
我尝试使用 Scikit Learn 的 GridSearch class 来调整逻辑回归算法的超参数。
然而,GridSearch,即使在并行使用多个作业时,也需要几天的时间来处理,除非您只调整一个参数。我想过使用 Apache Spark 来加快这个过程,但我有两个问题。
为了使用 Apache Spark,您真的需要多台机器来分配工作负载吗?例如,如果您只有一台笔记本电脑,那么使用 Apache Spark 是否毫无意义?
有没有在 Apache Spark 中使用 Scikit Learn 的 GridSearch 的简单方法?
我已经阅读了文档,但它谈到了整个机器学习管道上的 运行 个并行工作器,但我只是想用它来进行参数调整。
进口
import datetime
%matplotlib inline
import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from datetime import datetime as dt
import scipy
import itertools
ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')
pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)
算法超参数调优
X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
knn = KNeighborsClassifier()
parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'],
'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}
# ======== What I want to do in Apache Spark ========= #
%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_
# ==================================================== #
您可以使用名为 spark-sklearn 的库来 运行 分布式参数扫描。您是正确的,因为您需要一组机器或一台多 CPU 机器来获得并行加速。
希望这对您有所帮助,
Roope - Microsoft MMLSpark 团队
我尝试使用 Scikit Learn 的 GridSearch class 来调整逻辑回归算法的超参数。
然而,GridSearch,即使在并行使用多个作业时,也需要几天的时间来处理,除非您只调整一个参数。我想过使用 Apache Spark 来加快这个过程,但我有两个问题。
为了使用 Apache Spark,您真的需要多台机器来分配工作负载吗?例如,如果您只有一台笔记本电脑,那么使用 Apache Spark 是否毫无意义?
有没有在 Apache Spark 中使用 Scikit Learn 的 GridSearch 的简单方法?
我已经阅读了文档,但它谈到了整个机器学习管道上的 运行 个并行工作器,但我只是想用它来进行参数调整。
进口
import datetime
%matplotlib inline
import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from datetime import datetime as dt
import scipy
import itertools
ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')
pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)
算法超参数调优
X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
knn = KNeighborsClassifier()
parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'],
'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}
# ======== What I want to do in Apache Spark ========= #
%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_
# ==================================================== #
您可以使用名为 spark-sklearn 的库来 运行 分布式参数扫描。您是正确的,因为您需要一组机器或一台多 CPU 机器来获得并行加速。
希望这对您有所帮助,
Roope - Microsoft MMLSpark 团队