在 Scala 中将 org.apache.spark.mllib.linalg.Matrix 转换为 spark 数据帧
Convert org.apache.spark.mllib.linalg.Matrix to spark dataframe in Scala
我有一个输入数据框 input_df
作为:
+---------------+--------------------+
|Main_CustomerID| Vector|
+---------------+--------------------+
| 725153|[3.0,2.0,6.0,0.0,9.0|
| 873008|[4.0,1.0,0.0,1.0,...|
| 625109|[1.0,0.0,6.0,1.0,...|
| 817171|[0.0,4.0,0.0,7.0,...|
| 611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+
input_df
是架构类型,
root
|-- Main_CustomerID: integer (nullable = true)
|-- Vector: vector (nullable = true)
参考,我创建了索引行矩阵,然后我做了:
val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix
查找列之间的余弦相似度。现在我有一个结果 mllib
矩阵,
cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0 0.4199605255658081 0.5744269579035528 0.22075539284417395 0.561434614044346
0.0 0.0 0.2791452631195413 0.7259079527665503 0.6206918387272496
0.0 0.0 0.0 0.31792539222893695 0.6997167152675132
0.0 0.0 0.0 0.0 0.6776404124278828
0.0 0.0 0.0 0.0 0.0
现在,我需要将 org.apache.spark.mllib.linalg.Matrix
类型的 lm
转换为数据帧。我希望我的输出 dataframe
如下所示:
+---+------------------+------------------+-------------------+------------------+
| _1| _2| _3| _4| _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0| 0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0| 0.0| 0.0|0.31792539222893695|0.6997167152675132|
|0.0| 0.0| 0.0| 0.0|0.6776404124278828|
|0.0| 0.0| 0.0| 0.0| 0.0|
+---+------------------+------------------+-------------------+------------------+
我如何在 Scala 中执行此操作?
要将 Matrix
转换为指定的数据帧,请执行以下操作。它首先将矩阵转换为包含带有数组的单列的数据框。然后 foldLeft
用于将数组分成单独的列。
import spark.implicits._
val cols = (0 until lm.numCols).toSeq
val df = lm.transpose
.colIter.toSeq
.map(_.toArray)
.toDF("arr")
val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
.drop("arr")
我有一个输入数据框 input_df
作为:
+---------------+--------------------+
|Main_CustomerID| Vector|
+---------------+--------------------+
| 725153|[3.0,2.0,6.0,0.0,9.0|
| 873008|[4.0,1.0,0.0,1.0,...|
| 625109|[1.0,0.0,6.0,1.0,...|
| 817171|[0.0,4.0,0.0,7.0,...|
| 611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+
input_df
是架构类型,
root
|-- Main_CustomerID: integer (nullable = true)
|-- Vector: vector (nullable = true)
参考
val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix
查找列之间的余弦相似度。现在我有一个结果 mllib
矩阵,
cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0 0.4199605255658081 0.5744269579035528 0.22075539284417395 0.561434614044346
0.0 0.0 0.2791452631195413 0.7259079527665503 0.6206918387272496
0.0 0.0 0.0 0.31792539222893695 0.6997167152675132
0.0 0.0 0.0 0.0 0.6776404124278828
0.0 0.0 0.0 0.0 0.0
现在,我需要将 org.apache.spark.mllib.linalg.Matrix
类型的 lm
转换为数据帧。我希望我的输出 dataframe
如下所示:
+---+------------------+------------------+-------------------+------------------+
| _1| _2| _3| _4| _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0| 0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0| 0.0| 0.0|0.31792539222893695|0.6997167152675132|
|0.0| 0.0| 0.0| 0.0|0.6776404124278828|
|0.0| 0.0| 0.0| 0.0| 0.0|
+---+------------------+------------------+-------------------+------------------+
我如何在 Scala 中执行此操作?
要将 Matrix
转换为指定的数据帧,请执行以下操作。它首先将矩阵转换为包含带有数组的单列的数据框。然后 foldLeft
用于将数组分成单独的列。
import spark.implicits._
val cols = (0 until lm.numCols).toSeq
val df = lm.transpose
.colIter.toSeq
.map(_.toArray)
.toDF("arr")
val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
.drop("arr")