我在哪里可以使用 weka 在 java 中找到 KNN 的实际示例

Question

我一直在寻找使用 weka 实现 KNN 的实际示例，但我发现的所有内容对我来说太笼统了，无法理解它需要能够工作的数据（或者可能如何制作它所需要的对象）需要工作）以及它显示的结果，也许以前使用过它的人有一个更好的例子，比如现实的东西（产品、电影、书籍等），而不是你在代数上看到的典型字母。

所以我可以弄清楚如何在我的案例中实现它（这是向使用 KNN 的活跃用户推荐菜肴），将不胜感激，谢谢。

我试图用这个 link https://www.ibm.com/developerworks/library/os-weka3/index.html 来理解，但我什至不明白他们是如何得到这个结果的以及他们是如何得到公式

第 1 步：确定距离公式

Distance = SQRT( ((58 - Age)/(69-35))^2) + ((51000 - Income)/(150000-38000))^2 )

为什么总是/(69-35) 和/(150000-38000)？

编辑：

这是我试过但没有成功的代码，如果有人能帮我清除它，我很感激，我也是通过结合这 2 个答案来编写这段代码的：

这个答案显示了如何获得 knn：

How to get the nearest neighbor in weka using java

这个告诉我如何创建实例（我真的不知道它们对 weka 有什么用）Adding a new Instance in weka

所以我想到了这个：

public class Wekatest {

    public static void main(String[] args) {

        ArrayList<Attribute> atts = new ArrayList<>();
        ArrayList<String> classVal = new ArrayList<>();
        // I don't really understand whats happening here
        classVal.add("A");
        classVal.add("B");
        classVal.add("C");
        classVal.add("D");
        classVal.add("E");
        classVal.add("F");

        atts.add(new Attribute("content", (ArrayList<String>) null));
        atts.add(new Attribute("@@class@@", classVal));

        // Here in my case the data to evaluate are dishes (plato mean dish in spanish)
        Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);

        // I imagine that every instance is like an Object that will be compared with the other instances, to get its neaerest neightbours (so an instance is like a dish for me)..

        double[] instanceValue1 = new double[dataRaw.numAttributes()];

        instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue1[1] = 0;

        dataRaw.add(new DenseInstance(1.0, instanceValue1));

        double[] instanceValue2 = new double[dataRaw.numAttributes()];

        instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
        instanceValue2[1] = 1;

        dataRaw.add(new DenseInstance(1.0, instanceValue2));

        double[] instanceValue3 = new double[dataRaw.numAttributes()];

        instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue3[1] = 2;

        dataRaw.add(new DenseInstance(1.0, instanceValue3));

        double[] instanceValue4 = new double[dataRaw.numAttributes()];

        instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
        instanceValue4[1] = 3;

        dataRaw.add(new DenseInstance(1.0, instanceValue4));

        double[] instanceValue5 = new double[dataRaw.numAttributes()];

        instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue5[1] = 4;

        dataRaw.add(new DenseInstance(1.0, instanceValue5));

        System.out.println("---------------------");

        weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
        try {

            // This method receives the goal instance which you wanna know its neighbours and N (I don't really know what N is but I imagine it is the number of neighbours I want)
            Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
            // I expected the output to be the closes neighbour to dataRaw.get(0) which would be Pizzas, but instead I got some data that I don't really understand.


            System.out.println(nearestInstances);

        } catch (Exception e) {

            e.printStackTrace();
        }

    }

}

OUTPUT:

---------------------
@relation TestInstancesPlatos

@attribute content string
@attribute @@class@@ {A,B,C,D,E,F}

@data
Pizzas,A
Tunas,B
Pizzas,C
Hamburguers,D

使用了 weka 依赖项：

<dependency>
        <groupId>nz.ac.waikato.cms.weka</groupId>
        <artifactId>weka-stable</artifactId>
        <version>3.8.0</version>
    </dependency>

Answer 1

相当简单。为了理解为什么它总是 /(69-35) 和 /(150000-38000)，您首先需要了解规范化的含义。

规范化:
规范化通常意味着将变量缩放为具有 0 到 1 之间的值。
公式如下：

如果你仔细观察上面公式的分母，你会发现它是所有数字的最小值减去所有数字的最大值。

现在，回到你的问题...看问题的第 5 行。其内容如下。

The easiest and most common distance calculation is the "Normalized Euclidian Distance."

在您的年龄列中，您可以看到最小值为 35，最大值为 69.Similarly，在您的收入列中，您的最小值为 38k，最大值为 150k。

这就是您始终拥有/(69-35) 和/(150000-38000) 的确切原因。

希望你已经看懂了。

和平

Answer 2

KNN 是一种机器学习技术，通常 class 化为 "Instance-Based predictor"。它采用 class 化样本的所有 个实例 并将它们绘制在 n 维 space 中。

KNN利用欧几里德距离等算法，在这个n维space中寻找最近的点，并根据这些近邻估计它属于哪个class。靠近蓝点就是蓝色，靠近红点...

但是现在，我们如何将其应用于您的问题？

假设您只有两个属性，价格和卡路里（二维 space）。您想要 class 将客户分为三种 class：健康、垃圾食品、美食。有了这个，您可以在餐厅提供与客户偏好相似的优惠。

您有以下数据：

+-------+----------+-----------+
| Price | Calories | Food Type |
+-------+----------+-----------+
|     |    350   | Junk Food |
+-------+----------+-----------+
|     |    700   | Junk Food |
+-------+----------+-----------+
|    |    200   | Fit       |
+-------+----------+-----------+
|     |    400   | Junk Food |
+-------+----------+-----------+
|     |    150   | Fit       |
+-------+----------+-----------+
|     |    650   | Junk Food |
+-------+----------+-----------+
|     |    120   | Fit       |
+-------+----------+-----------+
|    |    230   | Gourmet   |
+-------+----------+-----------+
|    |    210   | Fit       |
+-------+----------+-----------+
|    |    475   | Gourmet   |
+-------+----------+-----------+
|    |    600   | Gourmet   |
+-------+----------+-----------+

现在，让我们看看它在 2D 中的绘制space：

接下来会发生什么？

对于每个新条目，算法计算到所有点（实例）的距离并找到 k 个最近的点。从这k个最近的class，定义了新条目的class。

取 k = 3 和值 $15 和 165 cal。让我们找到 3 个最近的邻居：

这就是距离公式的用武之地。它实际上对每个点都进行了计算。然后这些距离是 "ranked"，k 个最接近的距离构成最终的 class.

现在，为什么值 /(69-35) 和 /(150000-38000)？正如其他答案中提到的，这是由于规范化。我们的示例使用 price 和 cal。如图所示，卡路里的顺序高于金钱（每个值的单位更多）。为了避免不平衡，例如可以使卡路里对 class 比价格更有价值（例如，这会杀死 Gourmet class），需要使所有属性同样重要，因此使用归一化。

Weka 为您抽象了它，但您也可以将其可视化。请参阅我为 Weka ML 课程制作的项目中的可视化示例：

注意，因为有很多多于2的维度，所以有很多情节，但思路是相似的。

解释代码：

public class Wekatest {

    public static void main(String[] args) {
//These two ArrayLists are the inputs of your algorithm.
//atts are the attributes that you're going to pass for training, usually called X.
//classVal is the target class that is to be predicted, usually called y.
        ArrayList<Attribute> atts = new ArrayList<>();
        ArrayList<String> classVal = new ArrayList<>();
//Here you initiate a "dictionary" of all distinct types of restaurants that can be targeted.
        classVal.add("A");
        classVal.add("B");
        classVal.add("C");
        classVal.add("D");
        classVal.add("E");
        classVal.add("F");
// The next two lines initiate the attributes, one made of "content" and other pertaining to the class of the already labeled values.
        atts.add(new Attribute("content", (ArrayList<String>) null));
        atts.add(new Attribute("@@class@@", classVal));

//This loads a Weka object of data for training, using attributes and classes from a file "TestInstancePlatos" (or should happen).
//dataRaw contains a set of previously labelled instances that are going to be used do "train the model" (kNN actually doesn't tain anything, but uses all data for predictions)
        Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);


//Here you're starting new instances to test your model. This is where you can substitute for new inputs for production.
        double[] instanceValue1 = new double[dataRaw.numAttributes()];

//It looks you only have 2 attributes, a food product and a rating maybe.
        instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue1[1] = 0;

//You're appending this new instance to the model for evaluation.
        dataRaw.add(new DenseInstance(1.0, instanceValue1));

        double[] instanceValue2 = new double[dataRaw.numAttributes()];

        instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
        instanceValue2[1] = 1;

        dataRaw.add(new DenseInstance(1.0, instanceValue2));

        double[] instanceValue3 = new double[dataRaw.numAttributes()];

        instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue3[1] = 2;

        dataRaw.add(new DenseInstance(1.0, instanceValue3));

        double[] instanceValue4 = new double[dataRaw.numAttributes()];

        instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
        instanceValue4[1] = 3;

        dataRaw.add(new DenseInstance(1.0, instanceValue4));

        double[] instanceValue5 = new double[dataRaw.numAttributes()];

        instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue5[1] = 4;

        dataRaw.add(new DenseInstance(1.0, instanceValue5));

// After adding 5 instances, time to test:
        System.out.println("---------------------");

//Load the algorithm with data.
        weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
//You're predicting the class of value 0 of your data raw values. You're asking the answer among 1 neighbor (second attribute)
        try {
            Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
//You will get a value among A and F, that are the classes passed.
           System.out.println(nearestInstances);

        } catch (Exception e) {

            e.printStackTrace();
        }

    }

}

你应该怎么做？

-> Gather data. 
-> Define a set of attributes that help you to predict which cousine you have (ex.: prices, dishes or ingredients (have one attribute for each dish or ingredient). 
-> Organize this data. 
-> Define a set of labels.
-> Manually label a set of data.
-> Load labelled data to KNN.
-> Label new instances by passing their attributes to KNN. It'll return you the label of the k nearest neighbors (good values for k are 3 or 5, have to test).
-> Have fun!

我在哪里可以使用 weka 在 java 中找到 KNN 的实际示例

Where can I find practical example of KNN in java using weka

knn

weka