在scikit-learn中通过EM估计簇数

Estimate number of clusters through EM in scikit-learn

我正在尝试使用 Weka 中的 EM 实现聚类估计方法,更准确地说是以下描述:

The cross validation performed to determine the number of clusters is done in the following steps:

  1. the number of clusters is set to 1
  2. the training set is split randomly into 10 folds.
  3. EM is performed 10 times using the 10 folds the usual CV way.
  4. the loglikelihood is averaged over all 10 results.
  5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.

我目前的实现如下:

def estimate_n_clusters(X):
   "Find the best number of clusters through maximization of the log-likelihood from EM."
   last_log_likelihood = None
   kf = KFold(n_splits=10, shuffle=True)
   components = range(50)[1:]
   for n_components in components:
       gm = GaussianMixture(n_components=n_components)

       log_likelihood_list = []
       for train, test in kf.split(X):
           gm.fit(X[train, :])
           if not gm.converged_:
               raise Warning("GM not converged")
           log_likelihood = np.log(-gm.score_samples(X[test, :]))

           log_likelihood_list += log_likelihood.tolist()

       avg_log_likelihood = np.average(log_likelihood_list)

       if last_log_likelihood is None:
           last_log_likelihood = avg_log_likelihood
       elif avg_log_likelihood+10E-6 <= last_log_likelihood:
           return n_components
       last_log_likelihood = avg_log_likelihood

我通过 Weka 和我的函数获得了相似数量的聚类,但是,使用函数

估计的聚类数量 n_clusters
gm = GaussianMixture(n_components=n_clusters).fit(X)
print(np.log(-gm.score(X)))

结果为 NaN,因为 -gm.score(X) 产生负结果(大约 -2500)。而 Weka 报告 Log likelihood: 347.16447.

我的猜测是 Weka 的第 4 步中提到的可能性与 functionscore_samples().

中提到的可能性不同

谁能告诉我哪里出错了?

谢谢

根据文档,score returns 平均 log 可能性。显然,您不想使用 log-log。

为了将来参考,固定函数如下所示:

def estimate_n_clusters(X):
   "Find the best number of clusters through maximization of the log-likelihood from EM."
   last_log_likelihood = None
   kf = KFold(n_splits=10, shuffle=True)
   components = range(50)[1:]
   for n_components in components:
       gm = GaussianMixture(n_components=n_components)

       log_likelihood_list = []
       for train, test in kf.split(X):
           gm.fit(X[train, :])
           if not gm.converged_:
               raise Warning("GM not converged")
           log_likelihood = -gm.score_samples(X[test, :])

           log_likelihood_list += log_likelihood.tolist()

       avg_log_likelihood = np.average(log_likelihood_list)
       print(avg_log_likelihood)

       if last_log_likelihood is None:
           last_log_likelihood = avg_log_likelihood
       elif avg_log_likelihood+10E-6 <= last_log_likelihood:
           return n_components-1
       last_log_likelihood = avg_log_likelihood