创建一个 densevectors 的单位矩阵作为 Spark 数据帧

Question

我需要了解如何在 Spark 中创建任意大小的 DenseVectors 单位矩阵。我试图从 mllib.linalg.distributed 模块做一些事情但无济于事。我需要的是一个包含一列 "features" 的数据框，其中有 DenseVectors 作为其行，其中每一行都是单位矩阵中的对应行。

Answer 1

pyspark.mllib.linalg.distributed简单明了：

from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix
from pyspark import SparkContext

def identity(n: int, sc: SparkContext) -> CoordinateMatrix:
    return CoordinateMatrix(
        sc.range(n).map(lambda i: MatrixEntry(i, i, 1.0)), n, n)

使用DataFrames和DenseVectors没有多大意义。首先 DataFrames 是无序的，不支持代数运算。此外，使用 DenseVectors 会导致任何大小的矩阵出现内存问题，而使用分布式数据结构是合理的。

创建一个 densevectors 的单位矩阵作为 Spark 数据帧

Create an identity matrix of densevectors as a Spark dataframe

apache-spark

apache-spark-sql

pyspark

apache-spark-mllib