机器学习中的训练和测试数据

Training and testing data in machine learning

我想使用 K-means 算法训练数据,然后在仅删除一列的另一种类似数据上对其进行测试。我是机器学习的新手,所以从 https://www.datacamp.com/community/tutorials/k-means-clustering-python 中获取代码以应用于我的一个数据集,但在本网站中,预测部分发生在哪里?我们只是提供数据并测试准确性。我们如何将算法应用于测试数据(显然会有所不同)以预测缺失属性的值?

当你开始使用机器学习时,你的困惑是很常见的。

来自Wikipedia

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias).

也来自 Wikipedia:

Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.

K 均值聚类算法是一种无监督 学习算法。 在无监督学习中,你没有标签,因为你不想预测某些东西。相反,您正在尝试找到一种方法,以将具有共同特征的数据点组合在一起的方式对数据进行聚类。

你使用测试的原因(并且经常validation) sets in Supervised learning in the first place, is to evaluate the generalization properties of your model in order to avoid over-fitting。但是在无监督学习中,你无法评估它,因为你不知道数据的实际集群。因此没有指向使用测试集。