如何使 KMeans 预测更准确？

Question

我正在学习聚类和 KMeans 等，所以我对这个主题的了解非常基础。下面是我对它的工作原理进行的一些自学。基本上，如果 'a' 出现在任何列中，则 'Binary' 将等于 1。本质上，我是在尝试教它一种模式。我从使用 Titanic 数据集的教程中学到了以下内容，但我已经适应了我自己的数据。

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

我构建的数据

dataset = [
    [0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
    [0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
    [0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
    [1,'a','r','e'],[0,'g','c','c']     
]

df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()

df:

Binary  Col1  Col2  Col3
------------------------
  1       a    b     c
  0       x    t     v
  0       s    q     w
  1       n    m     a
  1       u    a     r

将非二进制编码为二进制：

labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])

labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])

labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])

将簇设置为两个，因为它不是 1 就是 0？

X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

测试一下：

correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

结果：

print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%

我怎样才能让它更准确到 99.99% 知道 'a' 表示二进制列为 1 的程度？更多数据？

Answer 1

K-means 甚至不尝试来预测这个值。因为它是一种无监督的方法。因为它不是预测算法；这是一项结构发现任务。不要将聚类误认为是分类。

簇号没有意义。它们是 0 和 1，因为它们是前两个整数。 K-means 是随机的。运行几次，有时你的得分也只有 29%。

此外，k-means 是为连续输入而设计的。您可以将它应用于二进制编码数据，但结果会很差。

如何使 KMeans 预测更准确？

How do you make a KMeans prediction more accurate?

cluster-analysis

machine-learning

k-means