pyspark 中 OneHotEncoder 的矢量大小错误

Wrong vector size of OneHotEncoder in pyspark

我已经尝试检查 pyspark 中 OneHotEncoder 的输出。我在编码器的论坛和文档中读到,编码向量的大小将等于正在编码的列中不同值的数量。

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",   outputCol="categoryIndex")

model = stringIndexer.fit(df)

indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")

encoded = encoder.transform(indexed)
encoded.show()

下面是上面代码的结果

+---+--------+--------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+--------------+-------------+
|  0|       a|           0.0|(2,[0],[1.0])|
|  1|       b|           2.0|    (2,[],[])|
|  2|       c|           1.0|(2,[1],[1.0])|
|  3|       a|           0.0|(2,[0],[1.0])|
|  4|       a|           0.0|(2,[0],[1.0])|
|  5|       c|           1.0|(2,[1],[1.0])|
+---+--------+--------------+-------------+

根据 categoryVec 列的解释,向量的大小为 2。而 "category" 列中不同值的数量为 3,即 a、b 和 c。请让我明白我在这里缺少什么。

来自 pyspark.ml.feature.OneHotEncoder 的文档:

class pyspark.ml.feature.OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

因此,对于 n 个类别,除非将 dropLast 设置为 False,否则您将获得大小为 n-1 的输出向量。这并没有错或奇怪 - 只是您只需要 n-1 个索引来唯一映射所有类别。