pyspark OneHotEncoded 向量似乎缺少类别?

pyspark OneHotEncoded vectors appear to be missing categories?

尝试使用 pyspark 的 OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) 为分类特征生成单热编码向量时发现一个奇怪的问题,其中单热向量似乎缺少某些类别(或者可能是显示时格式奇怪?)。

现在已经回答了这个问题(或提供一个答案),看来下面的细节与理解问题并不完全相关

具有以下形式的数据集

1. Wife's age                     (numerical)
2. Wife's education               (categorical)      1=low, 2, 3, 4=high
3. Husband's education            (categorical)      1=low, 2, 3, 4=high
4. Number of children ever born   (numerical)
5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
6. Wife's now working?            (binary)           0=Yes, 1=No
7. Husband's occupation           (categorical)      1, 2, 3, 4
8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
9. Media exposure                 (binary)           0=Good, 1=Not good
10. Contraceptive method used     (class attribute)  1=No-use, 2=Long-term, 3=Short-term  

实际数据看起来像

wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1

来自这里:https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.

在对数据进行一些其他预处理之后,然后尝试通过...将分类和二进制(只是为了练习)特征编码为 1hot 向量...

for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
    encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
    print encoder.k
    dataset = encoder.transform(dataset)

生成一行看起来像

Row(
    ...., 
    numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]), 
    wife_edu_1hot=SparseVector(4, {2: 1.0}), 
    husband_edu_1hot=SparseVector(4, {3: 1.0}), 
    husband_occupation_1hot=SparseVector(4, {2: 1.0}), 
    SoL_index_1hot=SparseVector(4, {3: 1.0}), 
    wife_religion_1hot=SparseVector(1, {0: 1.0}),
    wife_working_1hot=SparseVector(1, {0: 1.0}),
    media_exposure_1hot=SparseVector(1, {0: 1.0}),
    contraceptive_1hot=SparseVector(2, {0: 1.0})
)

我对稀疏向量格式的理解是 SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) 表示长度为 S 的向量,其中所有值均为 0,索引 i1,...,in 具有对应的值 v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html)。

基于此,似乎 this 案例中的 SparseVector 实际上表示向量中的最高索引(而不是大小)。此外,结合所有功能(通过 pyspark 的 VectorAssembler)并检查结果 dataset.head(n=1) 向量的数组版本显示

input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})

indicates a vector looking like

indices:  0        1       2  3  4...           9        12             17 18 19 20 21
        [-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]

我认为不可能有 >= 3 个连续 1 的序列(如上面向量的尾部附近所示),因为这表明其中一个 onehot 向量(例如。中间的 1) 只有大小 1,这对任何数据特征都没有意义。

对机器学习非常陌生,所以可能对这里的一些基本概念感到困惑,但是有人知道这里会发生什么吗?

在 pyspark 文档中找到这个 (https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder):

...with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

可以在此处 (http://www.algosome.com/articles/dummy-variable-trap-regression.html) and here (https://stats.stackexchange.com/q/290526/167299) 找到有关为什么要进行这种 last-category-dropping 的更多讨论。

我对任何类型的机器学习都很陌生,但似乎基本上(对于 回归 模型)删除最后一个分类值是为了避免称为 dummy variable trap 其中 "the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others" (所以基本上你会有一个冗余特征(我认为这不利于加权 ML 模型))。

例如。当 [isBoy, isGirl] 的编码会传达关于某人性别的相同信息时,不需要 [isBoy, isGirl, unspecified] 的 1hot 编码,此处 [1,0]=isBoy[0,1]=isGirl[0,0]=unspecified .

这个link(http://www.algosome.com/articles/dummy-variable-trap-regression.html)提供了一个很好的例子,结论是

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

** 注意:在寻找原问题的答案时,发现了这个类似的SO post (Why does Spark's OneHotEncoder drop the last category by default?)。然而,我认为当前的 post 保证存在,因为提到的 post 是关于 为什么 这种行为发生而这个 post 是关于被混淆为首先发生了什么以及当前问题标题在粘贴到google.[=时找不到提到的post这一事实22=]