在 Scala 中将 org.apache.spark.mllib.linalg.Matrix 转换为 spark 数据帧

Convert org.apache.spark.mllib.linalg.Matrix to spark dataframe in Scala

我有一个输入数据框 input_df 作为:

+---------------+--------------------+
|Main_CustomerID|              Vector|
+---------------+--------------------+
|         725153|[3.0,2.0,6.0,0.0,9.0|
|         873008|[4.0,1.0,0.0,1.0,...|
|         625109|[1.0,0.0,6.0,1.0,...|
|         817171|[0.0,4.0,0.0,7.0,...|
|         611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+

input_df是架构类型,

root
 |-- Main_CustomerID: integer (nullable = true)
 |-- Vector: vector (nullable = true)

参考,我创建了索引行矩阵,然后我做了:

val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix 

查找列之间的余弦相似度。现在我有一个结果 mllib 矩阵,

cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0  0.4199605255658081  0.5744269579035528  0.22075539284417395  0.561434614044346
0.0  0.0                 0.2791452631195413  0.7259079527665503   0.6206918387272496
0.0  0.0                 0.0                 0.31792539222893695  0.6997167152675132
0.0  0.0                 0.0                 0.0                  0.6776404124278828
0.0  0.0                 0.0                 0.0                  0.0

现在,我需要将 org.apache.spark.mllib.linalg.Matrix 类型的 lm 转换为数据帧。我希望我的输出 dataframe 如下所示:

+---+------------------+------------------+-------------------+------------------+
| _1|                _2|                _3|                 _4|                _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0|               0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0|               0.0|               0.0|0.31792539222893695|0.6997167152675132|
|0.0|               0.0|               0.0|                0.0|0.6776404124278828|
|0.0|               0.0|               0.0|                0.0|               0.0|
+---+------------------+------------------+-------------------+------------------+

我如何在 Scala 中执行此操作?

要将 Matrix 转换为指定的数据帧,请执行以下操作。它首先将矩阵转换为包含带有数组的单列的数据框。然后 foldLeft 用于将数组分成单独的列。

import spark.implicits._
val cols = (0 until lm.numCols).toSeq

val df = lm.transpose
  .colIter.toSeq
  .map(_.toArray)
  .toDF("arr")

val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
  .drop("arr")