无法理解 sklearn 的 PolynomialFeatures

Cannot understand with sklearn's PolynomialFeatures

在 sklearn 的多项式特征方面需要帮助。它适用于一个功能,但每当我添加多个功能时,除了提升到度数的值之外,它还会在数组中输出一些值。 例如:对于这个数组,

X=np.array([[230.1,37.8,69.2]])

当我尝试

X_poly=poly.fit_transform(X)

输出

[[ 1.00000000e+00 2.30100000e+02 3.78000000e+01 6.92000000e+01
5.29460100e+04 8.69778000e+03 1.59229200e+04 1.42884000e+03
2.61576000e+03 4.78864000e+03]]

这里,8.69778000e+03,1.59229200e+04,2.61576000e+03是什么?

你有 3 维数据,下面的代码生成所有 2 阶多边形特征:

X=np.array([[230.1,37.8,69.2]])
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures()
X_poly=poly.fit_transform(X)
X_poly
#array([[  1.00000000e+00,   2.30100000e+02,   3.78000000e+01,
#      6.92000000e+01,   5.29460100e+04,   8.69778000e+03,
#      1.59229200e+04,   1.42884000e+03,   2.61576000e+03,
#      4.78864000e+03]])

也可以用下面的代码生成:

a, b, c = 230.1, 37.8, 69.2 # 3-dimensional data
np.array([[1,a,b,c,a**2,a*b,c*a,b**2,b*c,c**2]]) # all possible degree-2 polynomial features
# array([[  1.00000000e+00,   2.30100000e+02,   3.78000000e+01,
      6.92000000e+01,   5.29460100e+04,   8.69778000e+03,
      1.59229200e+04,   1.42884000e+03,   2.61576000e+03,
      4.78864000e+03]])

如果你有特征 [a, b, c],默认的多项式特征(在 sklearn 中度数为 2)应该是 [1, a, b, c, a^2, b^2, c^2, ab, bc, ca]

2.61576000e+0337.8x62.2=2615,76 (2615,76 = 2.61576000 x 10^3)

您可以使用 PolynomialFeatures 以简单的方式创建新功能。有很好的参考here. Of course there are and disadvantages("Overfitting") of using PolynomialFeatures(see here).

编辑:
使用多项式特征时我们必须小心。计算多项式特征个数的公式为N(n,d)=C(n+d,d) 其中n为特征个数,d为多项式次数,C为二项式系数(组合)。在我们的例子中,数字是 C(3+2,2)=5!/(5-2)!2!=10 但是当特征的数量或度数是高度时,多项式特征变得太多了。例如:

N(100,2)=5151
N(100,5)=96560646

所以在这种情况下,您可能需要应用正则化 来惩罚一些权重。算法很可能会开始受到影响curse of dimensionality (here也是一个很好的讨论)。

PolynomialFeatures 生成一个新矩阵,其中包含具有给定度数的特征的所有多项式组合。

像 [a] 将转换为 [1,a,a^2] 的度数为 2。

您可以将输入转化为由 PolynomialFeatures 生成的矩阵。

from sklearn.preprocessing import PolynomialFeatures
a = np.array([1,2,3,4,5])
a = a[:,np.newaxis]
poly = PolynomialFeatures(degree=2)
a_poly = poly.fit_transform(a)
print(a_poly)

输出:

 [[ 1.  1.  1.]
 [ 1.  2.  4.]
 [ 1.  3.  9.]
 [ 1.  4. 16.]
 [ 1.  5. 25.]]

可以看到以[1,a,a^2]形式生成的矩阵

为了观察散点图上的多项式特征,我们使用数字 1-100。

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

#Making 1-100 numbers
a = np.arange(1,100,1)
a = a[:,np.newaxis]

#Scaling data with 0 mean and 1 standard Deviation, so it can be observed easily
scaler = StandardScaler()
a = scaler.fit_transform(a)

#Applying PolynomialFeatures
poly = PolynomialFeatures(degree=2)
a_poly = poly.fit_transform(a)

#Flattening Polynomial feature matrix (Creating 1D array), so it can be plotted. 
a_poly = a_poly.flatten()
#Creating array of size a_poly with number series. (For plotting)
xarr = np.arange(1,a_poly.size+1,1)

#Plotting
plt.scatter(xarr,a_poly)
plt.title("Degree 2 Polynomial")
plt.show()

输出:

变度=3,得:

根据 scikit 的 0.23 docs (and as far back as 0.15),PolynomialFeatures

[generate] a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

检查特征的一般方法是使用 poly.get_feature_names()。在这种情况下,它将是

In [15]: poly.get_feature_names(['a','b','c'])
Out[15]: ['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']

8.69778000e+03,1.59229200e+04,2.61576000e+03将相应地对应于a*ba*cb*c项。