在scikit-learn中通过EM估计簇数
Estimate number of clusters through EM in scikit-learn
我正在尝试使用 Weka 中的 EM 实现聚类估计方法,更准确地说是以下描述:
The cross validation performed to determine the number of clusters is
done in the following steps:
- the number of clusters is set to 1
- the training set is split randomly into 10 folds.
- EM is performed 10 times using the 10 folds the usual CV way.
- the loglikelihood is averaged over all 10 results.
- if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.
我目前的实现如下:
def estimate_n_clusters(X):
"Find the best number of clusters through maximization of the log-likelihood from EM."
last_log_likelihood = None
kf = KFold(n_splits=10, shuffle=True)
components = range(50)[1:]
for n_components in components:
gm = GaussianMixture(n_components=n_components)
log_likelihood_list = []
for train, test in kf.split(X):
gm.fit(X[train, :])
if not gm.converged_:
raise Warning("GM not converged")
log_likelihood = np.log(-gm.score_samples(X[test, :]))
log_likelihood_list += log_likelihood.tolist()
avg_log_likelihood = np.average(log_likelihood_list)
if last_log_likelihood is None:
last_log_likelihood = avg_log_likelihood
elif avg_log_likelihood+10E-6 <= last_log_likelihood:
return n_components
last_log_likelihood = avg_log_likelihood
我通过 Weka 和我的函数获得了相似数量的聚类,但是,使用函数
估计的聚类数量 n_clusters
gm = GaussianMixture(n_components=n_clusters).fit(X)
print(np.log(-gm.score(X)))
结果为 NaN,因为 -gm.score(X)
产生负结果(大约 -2500)。而 Weka 报告 Log likelihood: 347.16447
.
我的猜测是 Weka 的第 4 步中提到的可能性与 functionscore_samples()
.
中提到的可能性不同
谁能告诉我哪里出错了?
谢谢
根据文档,score
returns 平均 log 可能性。显然,您不想使用 log-log。
为了将来参考,固定函数如下所示:
def estimate_n_clusters(X):
"Find the best number of clusters through maximization of the log-likelihood from EM."
last_log_likelihood = None
kf = KFold(n_splits=10, shuffle=True)
components = range(50)[1:]
for n_components in components:
gm = GaussianMixture(n_components=n_components)
log_likelihood_list = []
for train, test in kf.split(X):
gm.fit(X[train, :])
if not gm.converged_:
raise Warning("GM not converged")
log_likelihood = -gm.score_samples(X[test, :])
log_likelihood_list += log_likelihood.tolist()
avg_log_likelihood = np.average(log_likelihood_list)
print(avg_log_likelihood)
if last_log_likelihood is None:
last_log_likelihood = avg_log_likelihood
elif avg_log_likelihood+10E-6 <= last_log_likelihood:
return n_components-1
last_log_likelihood = avg_log_likelihood
我正在尝试使用 Weka 中的 EM 实现聚类估计方法,更准确地说是以下描述:
The cross validation performed to determine the number of clusters is done in the following steps:
- the number of clusters is set to 1
- the training set is split randomly into 10 folds.
- EM is performed 10 times using the 10 folds the usual CV way.
- the loglikelihood is averaged over all 10 results.
- if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.
我目前的实现如下:
def estimate_n_clusters(X):
"Find the best number of clusters through maximization of the log-likelihood from EM."
last_log_likelihood = None
kf = KFold(n_splits=10, shuffle=True)
components = range(50)[1:]
for n_components in components:
gm = GaussianMixture(n_components=n_components)
log_likelihood_list = []
for train, test in kf.split(X):
gm.fit(X[train, :])
if not gm.converged_:
raise Warning("GM not converged")
log_likelihood = np.log(-gm.score_samples(X[test, :]))
log_likelihood_list += log_likelihood.tolist()
avg_log_likelihood = np.average(log_likelihood_list)
if last_log_likelihood is None:
last_log_likelihood = avg_log_likelihood
elif avg_log_likelihood+10E-6 <= last_log_likelihood:
return n_components
last_log_likelihood = avg_log_likelihood
我通过 Weka 和我的函数获得了相似数量的聚类,但是,使用函数
估计的聚类数量n_clusters
gm = GaussianMixture(n_components=n_clusters).fit(X)
print(np.log(-gm.score(X)))
结果为 NaN,因为 -gm.score(X)
产生负结果(大约 -2500)。而 Weka 报告 Log likelihood: 347.16447
.
我的猜测是 Weka 的第 4 步中提到的可能性与 functionscore_samples()
.
谁能告诉我哪里出错了?
谢谢
根据文档,score
returns 平均 log 可能性。显然,您不想使用 log-log。
为了将来参考,固定函数如下所示:
def estimate_n_clusters(X):
"Find the best number of clusters through maximization of the log-likelihood from EM."
last_log_likelihood = None
kf = KFold(n_splits=10, shuffle=True)
components = range(50)[1:]
for n_components in components:
gm = GaussianMixture(n_components=n_components)
log_likelihood_list = []
for train, test in kf.split(X):
gm.fit(X[train, :])
if not gm.converged_:
raise Warning("GM not converged")
log_likelihood = -gm.score_samples(X[test, :])
log_likelihood_list += log_likelihood.tolist()
avg_log_likelihood = np.average(log_likelihood_list)
print(avg_log_likelihood)
if last_log_likelihood is None:
last_log_likelihood = avg_log_likelihood
elif avg_log_likelihood+10E-6 <= last_log_likelihood:
return n_components-1
last_log_likelihood = avg_log_likelihood