将区域索引与真实标签相关联

Question

documentation 对此有些含糊，而我认为这将是一个非常简单的实现方式。

应用于 MNIST 数字数据集的 k_mean 算法输出 10 个区域，这些区域具有与之关联的特定数字，尽管它不是该区域中包含的大多数数字所代表的数字。

我有我的 ground_truth 标签 table。

如何让k_mean算法生成的每个区域最终被标记为被覆盖概率最高的数字？

我昨天花了几个小时编写这段代码来做到这一点，但它仍然不完整：

# TODO: for centroid-average method, see   
def most_probable_digit(indices, data):
    """
    Avec un tableau d'indices (d'un label spécifique assigné par scikit, obtenu avec "get_indices_of_label")
    où se situent les vrais labels dans 'data', cette fonction calcule combien de fois chaque vrai label
    apparaît et retourne celui qui est apparu le plus souvent (et donc qui a la plus grande probabilité
    d'être le ground_truth_label désigné par la région délimitée par scikit).
    :param indices: tableau des indices dans 'data' qui font parti d'une région du k_mean
    :param data: toutes les données réparties dans les régions du k_mean
    :return: la valeur (le digit) le plus probable associé à cette région
    """
    actual_labels = []
    for i in indices:
        actual_labels.append(data[i])
    if verbose: print("The actual labels for each of those digits are:", actual_labels)
    counts = count_labels("actual labels", actual_labels)
    probable = counts.index(max(counts))
    if verbose: print("Most probable digit:", probable)
    return probable


def get_list_of_indices(data, label):
    """
    Retourne une liste d'indices correspondant à tous les endroits
    où on peut trouver dans 'data' le 'label' spécifié
    :param data:
    :param label: le numéro associé à une région générée par k_mean
    :return:
    """
    return (np.where(data == label))[0].tolist()


# TODO: reassign in case of doubles
def obtain_corresponding_labels(data, real_labels):
    """
    Assign the most probable label to each region.
    :param data: list of regions associated with x_train or x_test (the order is preserved!)
    :param real_labels: actual labels to assign to the region numbers
    :return: the list of corresponding actual labels to region numbers
    """
    switches_to_make = []

    for i in range(10):
        list_of_indices = get_list_of_indices(data, i)  # indices in 'data' which are associated with region "i"
        probable_label = most_probable_digit(list_of_indices, real_labels)
        print("The assigned region", i, "should be considered as representing the digit ", probable_label)
        switches_to_make.append(probable_label)

    return switches_to_make


def rearrange_labels(switches_to_make, to_change):
    """
    Takes region numbers and assigns the most probable digit (label) to it.
    For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
    should be considered as representing the digit "5".
    :param switches_to_make: list of changes to make
    :param to_change: this table will be changed according to 'switches_to_make'
    :return: nothing, the change is made in-situ
    """
    for region in range(len(to_change)):
        for label in range(len(switches_to_make)):
            if to_change[region] == label:                    # if it corresponds to the "wrong" label given by scikit
                to_change[region] = switches_to_make[label]   # assign the "most probable" label
                break


def count_error_rate(found, truth):
    wrong = 0
    for i in range(len(found)):
        if found[i] != truth[i]:
            wrong += 1
    print("Error rate =     ", wrong / len(found) * 100, "%\n\n")


def treat_data(switches_to_make, predictions, truth):
    rearrange_labels(switches_to_make, predictions)    # Rearranging the training labels
    count_error_rate(predictions, truth)               # Counting error rate

目前，我的代码的问题是它可以生成重复项（如果两个区域具有相同的最高概率数字，则该数字与两个区域相关联）。

下面是我使用代码的方式：

kmeans = KMeans(n_clusters=10)  # TODO: eventually use "init=ndarray" to be able to use custom centroids for init ?
kmeans.fit(x_train)
training_labels = kmeans.labels_
print("Done with calculating the k-mean.\n")

switches_to_make = utils.obtain_corresponding_labels(training_labels, y_train)  # Obtaining the most probable labels

utils.treat_data(switches_to_make, training_labels, y_train)
print("Assigned labels:   ", training_labels)
print("Real labels:       ", y_train)


print("\n####################################################\nMoving on to predictions")
predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)

我的代码有大约 50% 的错误率。

Answer 1

如果我没理解错的话，您想将实际数字值分配为与该集群匹配的集群标签，对吗？如果是这样的话，我认为这是不可能的。

K-Means 是一种无监督学习算法。它不明白它在看什么，它分配的标签是任意的。而不是 0, 1, 2, ... 它可以称它们为 'apple', 'orange', 'grape' ... 。 K-Means 所能做的就是告诉你一堆数据点基于某种指标彼此相似，仅此而已。它非常适合数据探索或模式查找。但不是为了告诉你 "What" 它实际上是。

不管你做什么 post 处理，因为计算机永远无法以编程方式知道真正的标签是什么，除非你，人类，告诉它。在这种情况下，您不妨使用监督学习算法。

如果你想训练一个模型，当它看到一个数字时，它可以为它分配正确的标签，你必须使用监督学习方法（标签是一个东西）。查看 Random Forest instead, for instance. Here 也是类似的尝试。

Answer 2

这是使用我的解决方案的代码：

from sklearn.cluster import KMeans

import utils

# Extraction du dataset
x_train, y_train = utils.get_train_data()
x_test,  y_test  = utils.get_test_data()

kmeans = KMeans(n_clusters=10)
kmeans.fit(x_train)
training_labels = kmeans.labels_

switches_to_make = utils.find_closest_digit_to_centroids(kmeans, x_train, y_train)  # Obtaining the most probable labels (digits) for each region

utils.treat_data(switches_to_make, training_labels, y_train)

predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)

和utils.py：

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min


use_reduced = True  # Flag variable to use the reduced datasets (generated by 'pre_process.py')
verbose = False  # Should debugging prints be shown


def get_data(reduced_path, path):
    """
    Pour obtenir le dataset désiré.
    :param reduced_path: path vers la version réduite (générée par 'pre_process.py')
    :param path: path vers la version complète
    :return: numpy arrays (data, labels)
    """
    if use_reduced:
        data = open(reduced_path)
    else:
        data = open(path)
    csv_file = csv.reader(data)
    data_points = []
    for row in csv_file:
        data_points.append(row)
    data_points.pop(0)  # On enlève la première ligne, soit les "headers" de nos colonnes
    data.close()

    # Pour passer de String à int
    for i in range(len(data_points)):  # for each image
        for j in range(len(data_points[0])):  # for each pixel
            data_points[i][j] = int(data_points[i][j])
            # # Pour obtenir des valeurs en FLOAT normalisées entre 0 et 1:
            # data_points[i][j] =  np.divide(float(data_points[i][j]), 255)

    # Pour séparer les labels du data
    y_train = []  # labels
    for row in data_points:
        y_train.append(row[0])  # first column is the label
    x_train = []  # data
    for row in data_points:
        x_train.append(row[1:785])  # other columns are the pixels

    x_train = np.array(x_train)
    y_train = np.array(y_train)
    print("Done with loading the dataset.")

    return x_train, y_train


def get_test_data():
    """
    Retourne le dataset de test désiré.
    :return: numpy arrays (data, labels)
    """
    return get_data('../data/reduced_mnist_test.csv', '../data/mnist_test.csv')


def get_train_data():
    """
    Retourne le dataset de training désiré.
    :return: numpy arrays (data, labels)
    """
    return get_data('../data/reduced_mnist_train.csv', '../data/mnist_train.csv')


def display_data(x_train, y_train):
    """
    Affiche le digit voulu.
    :param x_train: le data (784D)
    :param y_train: le label associé
    :return:
    """
    # Exemple pour afficher: conversion de notre vecteur d'une dimension en 2 dimensions
    matrix = np.reshape(x_train, (28, 28))
    plt.imshow(matrix, cmap='gray')
    plt.title("Voici un " + str(y_train))
    plt.show()


def generate_mean_images(x_train, y_train):
    """
    Retourne le tableau des images moyennes pour chaque classe
    :param x_train:
    :param y_train:
    :return:
    """
    counts = np.zeros(10).astype(int)

    for label in y_train:
        counts[label] += 1

    sum_pixel_values = np.zeros((10, 784)).astype(int)

    for img in range(len(y_train)):
        for pixel in range(len(x_train[0])):
            sum_pixel_values[y_train[img]][pixel] += x_train[img][pixel]

    pixel_probability = np.zeros((len(counts), len(x_train[0])))  # (10, 784)

    for classe in range(len(counts)):
        for pixel in range(len(x_train[0])):
            pixel_probability[classe][pixel] = np.divide(sum_pixel_values[classe][pixel] + 1, counts[classe] + 2)

    mean_images = []

    if verbose:
        plt.figure(figsize=(20, 4))  # values of the size of the plot: (x,y) in INCHES
        plt.suptitle("Such wow, much impress !")

        for classe in range(len(counts)):
            class_mean = np.reshape(pixel_probability[classe], (28, 28))
            mean_images.append(class_mean)

            # Aesthetics
            plt.subplot(1, 10, classe + 1)
            plt.title(str(classe))
            plt.imshow(class_mean, cmap='gray')
            plt.xticks([])
            plt.yticks([])

        plt.show()

    return mean_images


#########
# used for "k_mean" (for now)


def count_labels(name, data):
    """
    S'occupe de compter le nombre de data associé à chacun des labels.
    :param name: nom de ce que l'on compte
    :param data: doit être 1D
    :return: counts = le nombre pour chaque label
    """
    header = "-- " + str(name) + " -- "  # making sure it's a String
    counts = [0]*10  # initializing the counting array

    for label in data:
        counts[label] += 1
    if verbose: print(header, "Amounts for each label:", counts)

    return counts


def get_list_of_indices(data, label):
    """
    Retourne une liste d'indices correspondant à tous les endroits
    où on peut trouver dans 'data' le 'label' spécifié
    :param data:
    :param label: le numéro associé à une région générée par k_mean
    :return:
    """
    return (np.where(data == label))[0].tolist()


def rearrange_labels(switches_to_make, to_change):
    """
    Takes region numbers and assigns the most probable digit (label) to it.
    For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
    should be considered as representing the digit "5".
    :param switches_to_make: list of changes to make
    :param to_change: this table will be changed according to 'switches_to_make'
    :return: nothing, the change is made in-situ
    """
    for region in range(len(to_change)):
        for label in range(len(switches_to_make)):
            if to_change[region] == label:                    # if it corresponds to the "wrong" label given by scikit
                to_change[region] = switches_to_make[label]   # assign the "most probable" label
                break


def count_error_rate(found, truth):
    wrong = 0
    for i in range(len(found)):
        if found[i] != truth[i]:
            wrong += 1
    percent = wrong / len(found) * 100

    print("Error rate =     ", percent, "%")
    return percent


def treat_data(switches_to_make, predictions, truth):
    rearrange_labels(switches_to_make, predictions)    # Rearranging the training labels
    count_error_rate(predictions, truth)               # Counting error rate


# TODO: reassign in case of doubles
# adapted from  
def find_closest_digit_to_centroids(kmean, data, labels):
    """
    The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
    Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
    closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
    If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
    :param kmean: the variable where the 'fit' data has been stored
    :param data: the actual data that was used with 'fit' (x_train)
    :param labels: the true labels associated with 'data' (y_train)
    :return: list where each region is at its index and the value at that index is the digit it represents
    """
    closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
                                               data,
                                               metric="euclidean")

    switches_to_make = []
    for region in range(len(closest)):
        truth = labels[closest[region]]
        print("The assigned region", region, "should be considered as representing the digit ", truth)
        switches_to_make.append(truth)

    print("Digits associated to each region (switches_to_make):", switches_to_make)
    return switches_to_make

基本上，这是解决我问题的函数：

# adapted from  
def find_closest_digit_to_centroids(kmean, data, labels):
    """
    The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
    Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
    closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
    If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
    :param kmean: the variable where the 'fit' data has been stored
    :param data: the actual data that was used with 'fit' (x_train)
    :param labels: the true labels associated with 'data' (y_train)
    :return: list where each region is at its index and the value at that index is the digit it represents
    """
    closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
                                               data,
                                               metric="euclidean")

    switches_to_make = []
    for region in range(len(closest)):
        truth = labels[closest[region]]
        print("The assigned region", region, "should be considered as representing the digit ", truth)
        switches_to_make.append(truth)

    print("Digits associated to each region (switches_to_make):", switches_to_make)
    return switches_to_make

将区域索引与真实标签相关联

Associating region index with true labels

k-means

python-3.x

scikit-learn