在事先不知道分类的情况下进行 SKLearn 多分类 Python

Question

我最近开始使用 SKLearn，尤其是分类模型，并且有更多关于用例示例的问题，而不是卡在任何特定的代码位上，所以如果这不是正确的地方，请提前道歉问这样的问题。

到目前为止，我一直在使用示例数据，其中根据已经 class 化的数据训练模型。以'Iris'数据集为例，所有数据都class化为三种之一。但是，如果一开始就想 group/classify 不知道 class 化的数据怎么办。

让我们采用这个虚构的数据：

  Name  Feat_1  Feat_2  Feat_3  Feat_4
0    A      12    0.10       0    9734
1    B      76    0.03       1   10024
2    C      97    0.07       1    8188
3    D      32    0.21       1    6420
4    E      45    0.15       0    7723
5    F      61    0.02       1   14987
6    G      25    0.22       0    5290
7    H      49    0.30       0    7107

如果想将名称拆分为 4 个单独的 classifications，使用不同的功能，这可能吗？需要哪个 SKLearn 模型？我不要求任何代码，如果有人能指出我正确的方向，我完全可以自己研究？到目前为止，我只能找到 classifications 已知的示例。

在上面的示例中，如果我想将数据分解为 4 个 class 化，我希望我的结果是这样的（注意新列，表示 class ):

  Name  Feat_1  Feat_2  Feat_3  Feat_4  Class
0    A      12    0.10       0    9734      4
1    B      76    0.03       1   10024      1
2    C      97    0.07       1    8188      3
3    D      32    0.21       1    6420      3
4    E      45    0.15       0    7723      2
5    F      61    0.02       1   14987      1
6    G      25    0.22       0    5290      4
7    H      49    0.30       0    7107      4

非常感谢您的帮助

Answer 1

你可以使用 k-mean 聚类，它会在每次迭代中将数据分成更小的 classes，直到所有数据都分为一组。然后，您可以在 classes 的数量达到您想要的数量时提前停止迭代，或者您也可以返回已经训练过的模型以获得您想要的 class 数量。例如，要获得 4 classes，当数据聚集在 4 classes

中时，您可以后退 4 步

sklearn.cluster.KMeans doc

Answer 2

分类是一种监督方法，这意味着训练数据带有特征和标签。如果你想根据特征对数据进行分组，那么你可以使用一些聚类算法（无监督），例如sklearn.cluster.KMeans（k = 4）。

Answer 3

从无监督方法开始确定聚类...使用这些聚类作为标签。

我建议使用 sklearn 的 GMM 代替 k-means 的 。

https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

K-means 假设圆形簇。

Answer 4

本题名为：unsupervised learning

一些定义是：

Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs.[1] It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.

有很多算法，您需要尝试最适合您的算法，一些示例是：

层次聚类（在 Scipy 中实现：https://en.wikipedia.org/wiki/Single-linkage_clustering）
kmeans（在sklearn中实现：https://en.wikipedia.org/wiki/K-means_clustering）
dbscan（在sklearn中实现：https://en.wikipedia.org/wiki/DBSCAN）

在事先不知道分类的情况下进行 SKLearn 多分类 Python

SKLearn Multi Classification without Knowing the Classifications in Advance Python

python

classification

scikit-learn

multilabel-classification