pyspark 中 OneHotEncoder 的矢量大小错误
Wrong vector size of OneHotEncoder in pyspark
我已经尝试检查 pyspark 中 OneHotEncoder 的输出。我在编码器的论坛和文档中读到,编码向量的大小将等于正在编码的列中不同值的数量。
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()
下面是上面代码的结果
+---+--------+--------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+--------------+-------------+
| 0| a| 0.0|(2,[0],[1.0])|
| 1| b| 2.0| (2,[],[])|
| 2| c| 1.0|(2,[1],[1.0])|
| 3| a| 0.0|(2,[0],[1.0])|
| 4| a| 0.0|(2,[0],[1.0])|
| 5| c| 1.0|(2,[1],[1.0])|
+---+--------+--------------+-------------+
根据 categoryVec 列的解释,向量的大小为 2。而 "category" 列中不同值的数量为 3,即 a、b 和 c。请让我明白我在这里缺少什么。
来自 pyspark.ml.feature.OneHotEncoder
的文档:
class pyspark.ml.feature.OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)
A one-hot encoder that maps a column of category indices to a column
of binary vectors, with at most a single one-value per row that
indicates the input category index. For example with 5 categories, an
input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0,
0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence
linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0,
0.0].
因此,对于 n
个类别,除非将 dropLast
设置为 False
,否则您将获得大小为 n-1
的输出向量。这并没有错或奇怪 - 只是您只需要 n-1
个索引来唯一映射所有类别。
我已经尝试检查 pyspark 中 OneHotEncoder 的输出。我在编码器的论坛和文档中读到,编码向量的大小将等于正在编码的列中不同值的数量。
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()
下面是上面代码的结果
+---+--------+--------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+--------------+-------------+
| 0| a| 0.0|(2,[0],[1.0])|
| 1| b| 2.0| (2,[],[])|
| 2| c| 1.0|(2,[1],[1.0])|
| 3| a| 0.0|(2,[0],[1.0])|
| 4| a| 0.0|(2,[0],[1.0])|
| 5| c| 1.0|(2,[1],[1.0])|
+---+--------+--------------+-------------+
根据 categoryVec 列的解释,向量的大小为 2。而 "category" 列中不同值的数量为 3,即 a、b 和 c。请让我明白我在这里缺少什么。
来自 pyspark.ml.feature.OneHotEncoder
的文档:
class pyspark.ml.feature.OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
因此,对于 n
个类别,除非将 dropLast
设置为 False
,否则您将获得大小为 n-1
的输出向量。这并没有错或奇怪 - 只是您只需要 n-1
个索引来唯一映射所有类别。