基于混合类型数据框的 K 均值

Question

我有以下数据集，我想对其应用聚类（特别是 k-means）。

     id      category     value
0    122         A          3
1    122         B          4
2    122         C          9
3    145         A          19
4    145         B          22
5    145         C          90
.
.
. 
197    225         A          16 
198    225         B          17
199    225         C          12

我想做的是创建id集群。例如，每个集群都应该包含一些基于类别值计算的相似性度量的 id。

例如：C1 {122, 145, 148} C2{ 225, 222, 221} ....

知道如何处理这类问题吗？

Answer 1

我假设有从 A 到 Z 的类别，并且有很多行属于同一类别。 K-means 算法的工作原理如下所述。从您的问题来看，不清楚什么是相似性度量。一旦我更清楚聚类 objective 是什么，我将更新我的答案。

更新：再次查看数据并注意到@Anony-Mousse 的评论后，我假设问题是：给定三个类别 A、B、C 及其各自的值和标签 (Id)，将它们聚类根据一些相似性度量（它可以是欧几里得距离、余弦距离或其他）。我正在更新我之前的答案以符合上述假设。

解析数据并生成三个数字或单热编码特征，代表每个 Id 的类别 A、B 和 C 的值。

K: input

Repeat until convergence:

Initialize 3-dimensional cluster centroids U1 to Uk randomly.

For each Id find smallest sum of euclidian distances between category values and the cluster centroids. Assign that cluster centroid as the new cluster center of the current Id.

For each cluster recompute its centroid by averaging features of all the samples (Ids) assigned to it.

聚类质心不变或每个质心变化小于作为输入提供的小值时可能会收敛。

Answer 2

Pivot你的数据变成合适的形状：

您的类别应该是列，而不是单独的行。

     id          A          B         C
1    122         3          4         9
2    145         19         22        90
..

不要忘记排除 ID 列进行分析！聚类时切勿包含 ID。为了便于分析，您的数据应该只有 A、B、C 列；每个 ID 一行。所以你有一个n x 3矩阵，那么你可以使用k-means就好了。

基于混合类型数据框的 K 均值

K Means based on mixed type dataframe

python

cluster-analysis

k-means

pandas