使用 SPARK 执行 PCA 后取回旧数据

Question

我正在使用 PCA 将矩阵 m*n 简化为矩阵 m*2。

我正在将 apache spark site 中的代码段用于我的项目，并且它有效。

import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val mat: RowMatrix = ...

    // Compute the top 2 principal components.
    val pc: Matrix = mat.computePrincipalComponents(2) // Principal components are stored in a local dense matrix.

    // Project the rows to the linear space spanned by the top 2 principal components.
    val projected: RowMatrix = mat.multiply(pc)

我还没有在 API 中看到是否有获取旧数据的方法。 为了了解PCA选择了哪些列作为主成分。

是否有任何库函数可以做到这一点？

更新

如果 PCA 算法选择并转换了我的两列数据，我想知道如何验证此转换涉及旧数据的哪些列？

示例

多维矩阵：

PCA 算法减少 2 个维度后，我将得到：

-1.4 3  
2 -4.0 
3 -2.9  
-0.9 6

说，我如何理解 PCA 从原始数据中选择 ,as principal components, 哪些列进行缩减？

提前致谢。

Answer 1

矩阵 pc 包含主成分作为其列。根据文档：

Rows correspond to observations and columns correspond to variables. The principal components are stored a local matrix of size n-by-k. Each column corresponds for one principal component, and the columns are in descending order of component variance.

因此，您可以通过执行

查看第i列

val pc: Matrix = ...
val i: Int = ...

for(row <- 0 until pc.numRows) {
  println(pc(row, i))
}

更新

如果你有输入矩阵mat =

其中每一行构成一个示例，每一列构成一个变量，然后您可以计算 PCA。方差最大的两个主成分是 pc =

0.6072    0.2049
0.3466    0.6626
-0.4674    0.7098
0.4343   -0.1024
0.3225    0.0689

每列构成投影方向，得到降维数据的单维。为了现在获得降维数据，你计算 mat * pc 这给你

2.1588    0.0706
-0.2041    9.5523
6.6652    8.9843
12.8425    5.5844

这是您的数据在低维向量中投影时的样子 space。同样，每一行代表一个示例，每一列代表一个变量。

如果我没有正确理解你的问题，那么你正在寻找矩阵 pc 的列，它告诉你每个原始维度对投影维度的贡献有多大。投影只是原始数据与投影方向（pc 的列）的标量积。

使用 SPARK 执行 PCA 后取回旧数据

Getting old data back after executing PCA using SPARK

algorithm

scala

pca

apache-spark