将向量数组转换为 DenseVector
Convert array of vectors to DenseVector
我 运行 使用 Scala 的 Spark 2.1。我正在尝试将向量数组转换为 DenseVector
.
这是我的数据框:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
例如,我需要将 hashValues
列的值提取到一个 DenseVector
for id 401310732094.
这可以通过 UDF
:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
这将用包含 DenseVector
.
的新列覆盖 hashValues
列
使用具有以下架构的数据框进行测试:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
结果是:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)
我 运行 使用 Scala 的 Spark 2.1。我正在尝试将向量数组转换为 DenseVector
.
这是我的数据框:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
例如,我需要将 hashValues
列的值提取到一个 DenseVector
for id 401310732094.
这可以通过 UDF
:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
这将用包含 DenseVector
.
hashValues
列
使用具有以下架构的数据框进行测试:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
结果是:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)