如何使用scala计算Spark中的相关矩阵?
How to calculate a correlation matrix in Spark using scala?
在 python pandas
中,当我有这样的 dataframe
df 时
c1
c2
c3
0.1
0.3
0.5
0.2
0.4
0.6
我可以使用 df.corr()
来计算相关矩阵。
如何使用 scala 在 spark 中执行此操作?
我看了官方文档,数据结构和上面的不一样。不知道怎么转。
更新一:
val df = Seq(
(0.1, 0.3, 0.5,0.6,0.8,0.1, 0.3, 0.5,0.6,0.8),
(0.2, 0.4, 0.6,0.7,0.7,0.2, 0.4, 0.6,0.7,0.7),
).toDF("c1", "c2", "c3","c4","c5","c6", "c7", "c8","c9","c10")
val assembler = new VectorAssembler().setInputCols(Array("c1", "c2", "c3","c4","c5","c6", "c7", "c8","c9","c10")).setOutputCol("vectors")
当列数为10时如何显示整个结果?
您可以使用以下代码解决您的问题。它将应用 Pearson 相关性,这也是 Pandas 函数的标准。
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.stat.Correlation
val df = Seq(
(0.1, 0.3, 0.5),
(0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")
val assembler = new VectorAssembler()
.setInputCols(Array("c1", "c2", "c3"))
.setOutputCol("vectors")
val transformed = assembler.transform(df)
val corr = Correlation.corr(transformed, "vectors").head
println(s"Pearson correlation matrix:\n $corr")
在 python pandas
中,当我有这样的 dataframe
df 时
c1 | c2 | c3 |
---|---|---|
0.1 | 0.3 | 0.5 |
0.2 | 0.4 | 0.6 |
我可以使用 df.corr()
来计算相关矩阵。
如何使用 scala 在 spark 中执行此操作?
我看了官方文档,数据结构和上面的不一样。不知道怎么转。
更新一:
val df = Seq(
(0.1, 0.3, 0.5,0.6,0.8,0.1, 0.3, 0.5,0.6,0.8),
(0.2, 0.4, 0.6,0.7,0.7,0.2, 0.4, 0.6,0.7,0.7),
).toDF("c1", "c2", "c3","c4","c5","c6", "c7", "c8","c9","c10")
val assembler = new VectorAssembler().setInputCols(Array("c1", "c2", "c3","c4","c5","c6", "c7", "c8","c9","c10")).setOutputCol("vectors")
当列数为10时如何显示整个结果?
您可以使用以下代码解决您的问题。它将应用 Pearson 相关性,这也是 Pandas 函数的标准。
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.stat.Correlation
val df = Seq(
(0.1, 0.3, 0.5),
(0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")
val assembler = new VectorAssembler()
.setInputCols(Array("c1", "c2", "c3"))
.setOutputCol("vectors")
val transformed = assembler.transform(df)
val corr = Correlation.corr(transformed, "vectors").head
println(s"Pearson correlation matrix:\n $corr")