这个问题是分类问题还是回归问题?
Is this problem a classification or regression?
在Andrew Ng的一次演讲中,他问下面的问题是分类问题还是回归问题。答:是回归问题
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
看来我遗漏了什么。根据我的理解,这应该是分类问题。原因是我们必须将每个项目分为两类,即它可以出售或不出售,这是离散值而不是连续值。
不知道我理解的差距在哪里。
您的想法是,您拥有一个包含具有各自特征的商品的数据库,并希望预测每件商品是否会售出。最后,您只需计算可以出售的商品数量。如果你这样定义问题,那么它确实是一个分类问题。
但是,请注意您问题中的以下句子:
You have a large inventory of identical items.
相同的项目 意味着所有项目将具有完全相同的特征。如果你想出一个二元分类器来判断产品是否可以销售,由于所有特征值完全相同,你的分类器会将所有项目放在同一类别中。
我猜想,要解决这个问题,您可能可以访问过去 5 年每月售出商品的时间序列,例如。然后,您将不得不处理这些数据并对未来进行插值。您不会单独对每个项目进行分类,而是实际计算一个数值,该数值表示未来 1、2 和 3 个月的已售商品数量。
根据Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
最重要的是,了解 categorical, ordinal, and numerical variables 之间的区别很重要,如统计数据所定义:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make ,000, ,000 and ,000.
虽然您的最终结果将是一个整数(一组离散的数字),但请注意它仍然是一个 numerical value, not a category。您可以在数学上操纵数值(例如,计算下一年售出商品的平均数量,找出未来 3 个月内售出商品的峰值数量......)但您不能对离散类别进行操作(例如,什么是手机和电话的平均值?)。
分类问题是输出是分类或顺序(根据 Bishop 的离散类别)的问题。回归问题输出数值(连续变量,根据 Bishop)。
您的系统可能仅限于输出整数而不是实数,但不会改变变量的数值性质。所以,你的问题是回归问题。
在Andrew Ng的一次演讲中,他问下面的问题是分类问题还是回归问题。答:是回归问题
You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months.
看来我遗漏了什么。根据我的理解,这应该是分类问题。原因是我们必须将每个项目分为两类,即它可以出售或不出售,这是离散值而不是连续值。
不知道我理解的差距在哪里。
您的想法是,您拥有一个包含具有各自特征的商品的数据库,并希望预测每件商品是否会售出。最后,您只需计算可以出售的商品数量。如果你这样定义问题,那么它确实是一个分类问题。
但是,请注意您问题中的以下句子:
You have a large inventory of identical items.
相同的项目 意味着所有项目将具有完全相同的特征。如果你想出一个二元分类器来判断产品是否可以销售,由于所有特征值完全相同,你的分类器会将所有项目放在同一类别中。
我猜想,要解决这个问题,您可能可以访问过去 5 年每月售出商品的时间序列,例如。然后,您将不得不处理这些数据并对未来进行插值。您不会单独对每个项目进行分类,而是实际计算一个数值,该数值表示未来 1、2 和 3 个月的已售商品数量。
根据Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
最重要的是,了解 categorical, ordinal, and numerical variables 之间的区别很重要,如统计数据所定义:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make ,000, ,000 and ,000.
虽然您的最终结果将是一个整数(一组离散的数字),但请注意它仍然是一个 numerical value, not a category。您可以在数学上操纵数值(例如,计算下一年售出商品的平均数量,找出未来 3 个月内售出商品的峰值数量......)但您不能对离散类别进行操作(例如,什么是手机和电话的平均值?)。
分类问题是输出是分类或顺序(根据 Bishop 的离散类别)的问题。回归问题输出数值(连续变量,根据 Bishop)。
您的系统可能仅限于输出整数而不是实数,但不会改变变量的数值性质。所以,你的问题是回归问题。