pyspark

Question

我正在使用 apache Spark ML 库来处理使用一种热编码的分类特征。编写以下代码后，我得到一个向量 c_idx_vec 作为一次热编码的输出。我明白如何解释这个输出向量，但我无法弄清楚如何将这个向量转换成列，以便我得到一个新的转换 dataframe.Take 这个数据集例如：

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()

    +----+---+-----+
    |   x|  c|c_idx|
    +----+---+-----+
    | 1.0|  a|  0.0|
    | 1.5|  a|  0.0|
    |10.0|  b|  1.0|
    | 3.2|  c|  2.0|
    +----+---+-----+

默认情况下，OneHotEncoder 将删除最后一个类别：

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
    +----+---+-----+-------------+
    |   x|  c|c_idx|    c_idx_vec|
    +----+---+-----+-------------+
    | 1.0|  a|  0.0|(2,[0],[1.0])|
    | 1.5|  a|  0.0|(2,[0],[1.0])|
    |10.0|  b|  1.0|(2,[1],[1.0])|
    | 3.2|  c|  2.0|    (2,[],[])|
    +----+---+-----+-------------+

当然，这个行为是可以改变的：

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()

    +----+---+-----+-------------+
    |   x|  c|c_idx|    c_idx_vec|
    +----+---+-----+-------------+
    | 1.0|  a|  0.0|(3,[0],[1.0])|
    | 1.5|  a|  0.0|(3,[0],[1.0])|
    |10.0|  b|  1.0|(3,[1],[1.0])|
    | 3.2|  c|  2.0|(3,[2],[1.0])|
    +----+---+-----+-------------+

所以，我想知道如何将我的 c_idx_vec 向量转换为新的数据帧，如下所示：

Answer 1

不确定这是最有效或最简单的方法，但您可以使用 udf 来完成；从您的 fl 数据框开始：

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf

def ith_(v, i):
    try:
        return float(v[i])
    except ValueError:
        return None

ith = udf(ith_, DoubleType())

(fl.withColumn('is_a', ith("c_idx_vec", lit(0)))
   .withColumn('is_b', ith("c_idx_vec", lit(1)))
   .withColumn('is_c', ith("c_idx_vec", lit(2))).show())

结果是：

+----+---+-----+-------------+----+----+----+
|   x|  c|c_idx|    c_idx_vec|is_a|is_b|is_c|   
+----+---+-----+-------------+----+----+----+
| 1.0|  a|  0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
| 1.5|  a|  0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
|10.0|  b|  1.0|(3,[1],[1.0])| 0.0| 1.0| 0.0|
| 3.2|  c|  2.0|(3,[2],[1.0])| 0.0| 0.0| 1.0| 
+----+---+-----+-------------+----+----+----+

即完全符合要求。

提供 udf 的 HT（和 +1）。

Answer 2

您可以执行以下操作：

>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

// Get c and its repective index. One hot encoder will put those on same index in vector

>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx =  sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
...     return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x   |c  |c_idx|c_idx_vec    |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a  |0.0  |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a  |0.0  |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b  |1.0  |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c  |2.0  |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+

// Typecast new columns to int

>>> for col in newCols:
...     result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x   |c  |c_idx|c_idx_vec    |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a  |0.0  |(3,[0],[1.0])|1   |0   |0   |
|1.5 |a  |0.0  |(3,[0],[1.0])|1   |0   |0   |
|10.0|b  |1.0  |(3,[1],[1.0])|0   |1   |0   |
|3.2 |c  |2.0  |(3,[2],[1.0])|0   |0   |1   |
+----+---+-----+-------------+----+----+----+

希望对您有所帮助！！

Answer 3

我找不到使用数据框访问稀疏向量的方法，我将其转换为 rdd。

from pyspark.sql import Row

# column names
labels = ['a', 'b', 'c']
extract_f = lambda row: Row(**row.asDict(), **dict(zip(labels, row.c_idx_vec.toArray())))
fe.rdd.map(extract_f).collect()

Answer 4

鉴于情况指定为使用StringIndexer生成索引号的情况，然后使用OneHotEncoderEstimator生成One-hot编码。从头到尾的整个代码应该是这样的：

生成数据并索引字符串值，StringIndexerModel对象是"saved"

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>>
>>> # need to save the indexer model object for indexing label info to be used later
>>> ss_fit = ss.fit(fd)
>>> ss_fit.labels   # to be used later
['a', 'b', 'c']

>>> ff = ss_fit.transform(fd)
>>> ff.show()

    +----+---+-----+
    |   x|  c|c_idx|
    +----+---+-----+
    | 1.0|  a|  0.0|
    | 1.5|  a|  0.0|
    |10.0|  b|  1.0|
    | 3.2|  c|  2.0|
    +----+---+-----+

使用 OneHotEncoderEstimator class 进行单热编码，因为 OneHotEncoder 是 deprecating

>>> oe = OneHotEncoderEstimator(inputCols=["c_idx"],outputCols=["c_idx_vec"])
>>> oe_fit = oe.fit(ff)
>>> fe = oe_fit.transform(ff)
>>> fe.show()
    +----+---+-----+-------------+
    |   x|  c|c_idx|    c_idx_vec|
    +----+---+-----+-------------+
    | 1.0|  a|  0.0|(2,[0],[1.0])|
    | 1.5|  a|  0.0|(2,[0],[1.0])|
    |10.0|  b|  1.0|(2,[1],[1.0])|
    | 3.2|  c|  2.0|    (2,[],[])|
    +----+---+-----+-------------+

执行单热二进制值整形。 one-hot 值将始终为 0.0 或 1.0.

>>> from pyspark.sql.types dimport FloatType, IntegerType
>>> from pyspark.sql.functions import lit, udf

>>> ith = udf(lambda v, i: float(v[i]), FloatType())
>>> fx = fe
>>> for sidx, oe_col in zip([ss_fit], oe.getOutputCols()):
... 
...     # iterate over string values and ignore the last one
...     for ii, val in list(enumerate(sidx.labels))[:-1]:
...         fx = fx.withColumn(
...             sidx.getInputCol() + '_' + val, 
...             ith(oe_col, lit(ii)).astype(IntegerType())
...         )
>>> fx.show()
+----+---+-----+-------------+---+---+
|   x|  c|c_idx|    c_idx_vec|c_a|c_b|
+----+---+-----+-------------+---+---+
| 1.0|  a|  0.0|(2,[0],[1.0])|  1|  0|
| 1.5|  a|  0.0|(2,[0],[1.0])|  1|  0|
|10.0|  b|  1.0|(2,[1],[1.0])|  0|  1|
| 3.2|  c|  2.0|    (2,[],[])|  0|  0|
+----+---+-----+-------------+---+---+

请注意，默认情况下，Spark 会删除最后一个类别。因此，根据行为，此处不需要 c_c 列。

pyspark - 将一次热编码后获得的稀疏向量转换为列

pyspark - Convert sparse vector obtained after one hot encoding into columns

apache-spark-sql

apache-spark-ml

apache-spark-mllib

one-hot-encoding