根据相关性使用 Python 对数据进行聚类

Question

我想将以下数据集聚类到对应于 "X" 的每一行（“\”和“/”）的两个聚类中。我在想这可以使用 Pearson 相关系数作为 Scikit-learn 凝聚聚类中的距离度量来完成，如此处所示 (How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering)。但这似乎不起作用。

原始数据图

Data:
-6.5955882 11.344538
-6.1911765 12.027311
-5.4191176 10.346639
-4.7573529 7.5105042
-2.9191176 7.7205882
-1.5955882 6.6176471
-2.9558824 6.039916
-1.1544118 3.9915966
-0.088235294 4.7794118
-0.088235294 2.8361345
0.53676471 -1.2079832
2.7794118 0
3.4044118 -4.3592437
5.2794118 -3.9915966
6.75 -8.5609244
7.4485294 -6.8802521
5.1691176 -5.7247899
-7.1470588 -2.8361345
-6.7058824 -1.2605042
-4.4264706 -1.1554622
-3.5073529 0.78781513
-0.86029412 0.31512605
-1.0808824 2.1533613
-2.8823529 -0.42016807
1.0514706 2.2584034
1.9338235 4.4117647
4.6544118 5.5147059
3.7352941 7.0378151
6.0147059 8.2457983
7.0808824 7.7205882

我试过的代码：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.stats import pearsonr

nc=2
data = np.loadtxt("cross-data_2.dat")
plt.scatter(data[:,0], data[:,1], s=100, cmap='viridis')

def pearson_affinity(M):
   return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])

hc = AgglomerativeClustering(n_clusters=nc, affinity = pearson_affinity, linkage = 'average')
y_hc = hc.fit_predict(data)

plt.figure()
plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=100, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=100, c='black')

plt.show()

聚类结果：

代码有问题还是我应该换个方法？

Answer 1

我可以提出一种替代方法来实现这一点。 由于您要尝试沿相同角度对点进行聚类，我们可以先将数据转换为极坐标 (r-theta)，然后使用简单的 KMeans 聚类.

r = np.sqrt(x[:, 0]**2 + x[:, 1]**2)
theta = np.arctan(x[:, 1]/x[:, 0])
xr = np.vstack((r*np.sin(theta), r*np.cos(theta))).T

from sklearn.cluster import KMeans
km = KMeans(2)
xx = km.fit_predict(xr)

plt.scatter(x[:, 0], x[:, 1], c=xx)

Answer 2

我为此提出另一种方法，Gaussian Mixture Models。

X = (your data)
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2,
                      init_params='random',
                      n_init=5,
                      random_state=123)
y_pred = gmm.fit_predict(X)
plt.scatter(*X.T, c=y_pred)

根据相关性使用 Python 对数据进行聚类

Clustering data with Python based on their correlation

python

cluster-analysis

correlation

scikit-learn

data-science