有没有办法对包含集群但不完全由集群组成的数据进行部分集群？

Is there a way to partially cluster data which contains clusters but does not wholly consist of clusters?

我有一些 2D 数据 (x,y)，我需要确定哪里有许多数据点在 x 方向上彼此靠近。有 3 个明显的簇，其中所有 x 点都靠得很近，其余数据不落入其中。我打算使用 k-means 聚类算法，但这似乎是为了对所有数据进行聚类，而我只想标记数据中显然是聚类的 3 个聚类数据，并将其余数据标记为普通数据。

数据在单独的 csv 文件中，我处理这些文件然后读入一个大数据帧。到目前为止，在处理数据时，我已经过滤掉了处理数据超过一定长度的文件，但这显然意味着有时部分集群被遗漏在文件之外或正常数据被遗漏。

你可以试试 DBSCAN which allows classification of points as "noise", and seems to be what you're after. There's a hierarchical version of this affiliated with the scikit project known as hdbscan

Google 找到了各种 documents 描述 k-means 聚类的替代方法。 hdbscan 文档对 comparing alternatives.

也有很好的描述

有没有办法对包含集群但不完全由集群组成的数据进行部分集群？

Is there a way to partially cluster data which contains clusters but does not wholly consist of clusters?

python

cluster-analysis

machine-learning

data-analysis