自定义 k 均值聚类 GridSearchCV
Custom k-means clustering GridSearchCV
我试图通过使用管道为 k-means 聚类找到 'best' 的 k
值,在管道中我使用标准缩放器,然后是自定义 k-means,最后是一个决策树分类器。然后我尝试使用此管道进行网格搜索以获得 k
的最佳值。 Python 正在使用 3.7 和 sklearn。
我的代码如下:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchC
class KMeansTransformer(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
# The purpose of 'self.model' is to contain the
# underlying cluster model-
self.model = KMeans(**kwargs)
def fit(self, X):
self.X = X
self.model.fit(X)
def transform(self, X):
pred = self.model.predict(X)
return np.hstack([self.X, pred.reshape(-1, 1)])
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
# Create features and target-
X, y = make_blobs(n_samples=100, n_features=2, centers=3)
# Get shape/dimension-
X.shape, y.shape
# ((100, 2), (100,))
# Create another pipeline using Decision Tree as classifier-
pipe_dt = Pipeline(
[
('sc', StandardScaler()),
('kmt', KMeansTransformer()),
('dt_clf', DecisionTreeClassifier())
]
)
# Train defined pipline-
pipe_dt.fit(X, y)
# Get accuracy score of pipeline-
pipe_dt.score(X, y)
# 1.0
# Make predictions using pipeline defined above-
y_pred_dt = pipe_dt.predict(X)
# Perform hyperparameter search/optimization using 'GridSearchCV'-
# Specify parameters to be hyper-tuned-
params = {
'n_clusters': [2, 3, 5, 7]
}
# Initialize GridSearchCV() object using 3-fold CV-
grid_kmt = GridSearchCV(param_grid=params, estimator=pipe_dt, cv = 3)
# Perform GridSearchCV on training data-
grid_kmt.fit(X, y)
当我使用 'grid_kmt.fit(X, y)' 时,出现以下错误:
ValueError: Invalid parameter n_clusters for estimator
Pipeline(memory=None,
steps=[('sc',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('kmt', KMeansTransformer()),
('dt_clf',
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated', random_state=None,
splitter='best'))],
verbose=False). Check the list of available parameters with
estimator.get_params().keys()
.
但是,当我为自定义 kmeans 初始化对象时-
# Initialize a new clustering object-
km = KMeansTransformer(n_clusters=3, init = 'k-means++')
# Get the list of available parameters-
km.get_params().keys()
# dict_keys([])
那为什么我得到 'ValueError'? n_clusters
恰好在自定义聚类对象的可用参数列表中。
仔细查看错误信息:
ValueError: Invalid parameter n_clusters for estimator Pipeline [...]
很明显,您的 GridSearchCV
在管道本身(而不是在其组件中)中寻找参数 n_clusters
,找不到任何参数,并且 returns错误。要正确访问 ('kmt', KMeansTransformer())
组件的 n_clusters
参数,您应该使用
params = {
'kmt__n_clusters': [2, 3, 5, 7] # two underscores
}
当然前提是您自己的 KMeansTransformer
接受参数 n_clusters
。
我试图通过使用管道为 k-means 聚类找到 'best' 的 k
值,在管道中我使用标准缩放器,然后是自定义 k-means,最后是一个决策树分类器。然后我尝试使用此管道进行网格搜索以获得 k
的最佳值。 Python 正在使用 3.7 和 sklearn。
我的代码如下:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchC
class KMeansTransformer(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
# The purpose of 'self.model' is to contain the
# underlying cluster model-
self.model = KMeans(**kwargs)
def fit(self, X):
self.X = X
self.model.fit(X)
def transform(self, X):
pred = self.model.predict(X)
return np.hstack([self.X, pred.reshape(-1, 1)])
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
# Create features and target-
X, y = make_blobs(n_samples=100, n_features=2, centers=3)
# Get shape/dimension-
X.shape, y.shape
# ((100, 2), (100,))
# Create another pipeline using Decision Tree as classifier-
pipe_dt = Pipeline(
[
('sc', StandardScaler()),
('kmt', KMeansTransformer()),
('dt_clf', DecisionTreeClassifier())
]
)
# Train defined pipline-
pipe_dt.fit(X, y)
# Get accuracy score of pipeline-
pipe_dt.score(X, y)
# 1.0
# Make predictions using pipeline defined above-
y_pred_dt = pipe_dt.predict(X)
# Perform hyperparameter search/optimization using 'GridSearchCV'-
# Specify parameters to be hyper-tuned-
params = {
'n_clusters': [2, 3, 5, 7]
}
# Initialize GridSearchCV() object using 3-fold CV-
grid_kmt = GridSearchCV(param_grid=params, estimator=pipe_dt, cv = 3)
# Perform GridSearchCV on training data-
grid_kmt.fit(X, y)
当我使用 'grid_kmt.fit(X, y)' 时,出现以下错误:
ValueError: Invalid parameter n_clusters for estimator Pipeline(memory=None, steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kmt', KMeansTransformer()), ('dt_clf', DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'))], verbose=False). Check the list of available parameters with
estimator.get_params().keys()
.
但是,当我为自定义 kmeans 初始化对象时-
# Initialize a new clustering object-
km = KMeansTransformer(n_clusters=3, init = 'k-means++')
# Get the list of available parameters-
km.get_params().keys()
# dict_keys([])
那为什么我得到 'ValueError'? n_clusters
恰好在自定义聚类对象的可用参数列表中。
仔细查看错误信息:
ValueError: Invalid parameter n_clusters for estimator Pipeline [...]
很明显,您的 GridSearchCV
在管道本身(而不是在其组件中)中寻找参数 n_clusters
,找不到任何参数,并且 returns错误。要正确访问 ('kmt', KMeansTransformer())
组件的 n_clusters
参数,您应该使用
params = {
'kmt__n_clusters': [2, 3, 5, 7] # two underscores
}
当然前提是您自己的 KMeansTransformer
接受参数 n_clusters
。