如何使用 Apache Spark 执行简单的网格搜索

How to perform simple grid search with Apache Spark

我尝试使用 Scikit Learn 的 GridSearch class 来调整逻辑回归算法的超参数。

然而,GridSearch,即使在并行使用多个作业时,也需要几天的时间来处理,除非您只调整一个参数。我想过使用 Apache Spark 来加快这个过程,但我有两个问题。

我已经阅读了文档,但它谈到了整个机器学习管道上的 运行 个并行工作器,但我只是想用它来进行参数调整。

进口

import datetime
%matplotlib inline

import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab

import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()

from datetime import datetime as dt
import scipy
import itertools

ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')

pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")

new_style = {'grid': False}
plt.rc('axes', **new_style)

算法超参数调优

X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)

knn = KNeighborsClassifier()

parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'], 
              'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}


# ======== What I want to do in Apache Spark ========= #

%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_

# ==================================================== #

您可以使用名为 spark-sklearn 的库来 运行 分布式参数扫描。您是正确的,因为您需要一组机器或一台多 CPU 机器来获得并行加速。

希望这对您有所帮助,

Roope - Microsoft MMLSpark 团队