如何计算 Spearman 相关系数与 Spark ?我无法从统计书中复制样本
How can I calculate a Spearman coefficient of correlation with Spark ? I am unable to reproduce a sample from a statistic book
为了用 Spark
和经典统计分析训练自己,我正在尝试执行书中给出的一些示例(中性统计书籍:不专门用于计算或 Spark)。
书中的示例提供了计算两个裁判给十个运动员的 Spearman 相关系数:
| Judge 1 | 8.3 | 7.6 | 9.1 | 9.5 | 8.4 | 6.9 | 9.2 | 7.8 | 8.6 | 8.2
| Judge 2 | 7.9 | 7.4 | 9.1 | 9.3 | 8.4 | 7.5 | 9.0 | 7.2 | 8.2 | 8.1
创建秩的中间矩阵,
|法官 1 | 5 | 2 | 8 | 10 | 6 | 1 | 9 | 3 | 7 | 4
|法官 2 | 4 | 2 | 9 | 10 | 7 | 3 | 8 | 1 | 6 | 5
书中的示例最终以以下结果结束:
r = 0.915
我试着用 Spark
那样实现它,according to the API documentation of Correlation :
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.6, 9.1, 9.5, 8.4, 6.9, 9.2, 7.8, 8.6, 8.2)),
RowFactory.create(Vectors.dense(7.9, 7.4, 9.1, 9.3, 8.4, 7.5, 9.0, 7.2, 8.2, 8.1))
);
StructType schema = new StructType(new StructField[]{
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
Dataset<Row> df = this.session.createDataFrame(data, schema);
Row r2 = Correlation.corr(df, "features", "spearman").head();
System.out.println("Spearman correlation matrix:\n" + r2.get(0).toString());
但它 return 我不是系数。相反,另一个对我来说很奇怪的矩阵:
Spearman correlation matrix:
1.0 0.9999999999999998 NaN ... (10 total)
0.9999999999999998 1.0 NaN ...
NaN NaN 1.0 ...
0.9999999999999998 0.9999999999999998 NaN ...
NaN NaN NaN ...
-0.9999999999999998 -0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
我是 MLib
的新手,统计能力不是很强。很明显,我做错了事。
我在这里看到了什么,而不是我所期望的,
我该如何达到我想要的结果?
部分问题的解决方法很丢人...
我只是把矢量放在错误的一边。还有这个,更正一下:
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.9)),
RowFactory.create(Vectors.dense(7.6, 7.4)),
RowFactory.create(Vectors.dense(9.1, 9.1)),
RowFactory.create(Vectors.dense(9.5, 9.3)),
RowFactory.create(Vectors.dense(8.4, 8.4)),
RowFactory.create(Vectors.dense(6.9, 7.5)),
RowFactory.create(Vectors.dense(9.2, 9.0)),
RowFactory.create(Vectors.dense(7.8, 7.2)),
RowFactory.create(Vectors.dense(8.6, 8.2)),
RowFactory.create(Vectors.dense(8.2, 8.1))
);
Correlation entre les notes des deux juges pour les sportifs :
1.0 0.9151515151515153
0.9151515151515153 1.0
为了用 Spark
和经典统计分析训练自己,我正在尝试执行书中给出的一些示例(中性统计书籍:不专门用于计算或 Spark)。
书中的示例提供了计算两个裁判给十个运动员的 Spearman 相关系数:
| Judge 1 | 8.3 | 7.6 | 9.1 | 9.5 | 8.4 | 6.9 | 9.2 | 7.8 | 8.6 | 8.2
| Judge 2 | 7.9 | 7.4 | 9.1 | 9.3 | 8.4 | 7.5 | 9.0 | 7.2 | 8.2 | 8.1
创建秩的中间矩阵,
|法官 1 | 5 | 2 | 8 | 10 | 6 | 1 | 9 | 3 | 7 | 4
|法官 2 | 4 | 2 | 9 | 10 | 7 | 3 | 8 | 1 | 6 | 5
书中的示例最终以以下结果结束:
r = 0.915
我试着用 Spark
那样实现它,according to the API documentation of Correlation :
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.6, 9.1, 9.5, 8.4, 6.9, 9.2, 7.8, 8.6, 8.2)),
RowFactory.create(Vectors.dense(7.9, 7.4, 9.1, 9.3, 8.4, 7.5, 9.0, 7.2, 8.2, 8.1))
);
StructType schema = new StructType(new StructField[]{
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
Dataset<Row> df = this.session.createDataFrame(data, schema);
Row r2 = Correlation.corr(df, "features", "spearman").head();
System.out.println("Spearman correlation matrix:\n" + r2.get(0).toString());
但它 return 我不是系数。相反,另一个对我来说很奇怪的矩阵:
Spearman correlation matrix:
1.0 0.9999999999999998 NaN ... (10 total)
0.9999999999999998 1.0 NaN ...
NaN NaN 1.0 ...
0.9999999999999998 0.9999999999999998 NaN ...
NaN NaN NaN ...
-0.9999999999999998 -0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
我是 MLib
的新手,统计能力不是很强。很明显,我做错了事。
我在这里看到了什么,而不是我所期望的,
我该如何达到我想要的结果?
部分问题的解决方法很丢人...
我只是把矢量放在错误的一边。还有这个,更正一下:
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.9)),
RowFactory.create(Vectors.dense(7.6, 7.4)),
RowFactory.create(Vectors.dense(9.1, 9.1)),
RowFactory.create(Vectors.dense(9.5, 9.3)),
RowFactory.create(Vectors.dense(8.4, 8.4)),
RowFactory.create(Vectors.dense(6.9, 7.5)),
RowFactory.create(Vectors.dense(9.2, 9.0)),
RowFactory.create(Vectors.dense(7.8, 7.2)),
RowFactory.create(Vectors.dense(8.6, 8.2)),
RowFactory.create(Vectors.dense(8.2, 8.1))
);
Correlation entre les notes des deux juges pour les sportifs :
1.0 0.9151515151515153
0.9151515151515153 1.0