分组以连接没有 collect_list/collect_set 的字符串 - Spark
Group by to concatenate strings without collect_list/collect_set - Spark
我有以下数据框:
+------------------------------------+------------------------------+
|MeteVarID |Conc |
+------------------------------------+------------------------------+
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 0 0.9604490986400536 |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 1 0.8109076852795446 |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 2 0.7282039568471731 |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 3 0.5335418350493728 |
我想按 MeteVarID
分组并连接字符串。最终的数据框应该是这样的:
9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d | Friday 0 0.9604490986400536, Friday 1 0.8109076852795446, etc.
您可以使用普通的 ol' RDD API 并切换回数据帧。
df.rdd
.map( c=> (c.getAs[String]("MeteVarID") , c.getAs[String]("Conc") ) )
.reduceByKey( _ +", "+ _)
.toDF("MeteVarID", "Conc")
.show(false)
+------------------------------------+------------------------------------------------------------------------------------------------------------------+
|MeteVarID |Conc |
+------------------------------------+------------------------------------------------------------------------------------------------------------------+
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 0 0.9604490986400536, Friday 1 0.8109076852795446, Friday 2 0.7282039568471731, Friday 3 0.5335418350493728|
+------------------------------------+------------------------------------------------------------------------------------------------------------------+
我有以下数据框:
+------------------------------------+------------------------------+
|MeteVarID |Conc |
+------------------------------------+------------------------------+
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 0 0.9604490986400536 |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 1 0.8109076852795446 |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 2 0.7282039568471731 |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 3 0.5335418350493728 |
我想按 MeteVarID
分组并连接字符串。最终的数据框应该是这样的:
9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d | Friday 0 0.9604490986400536, Friday 1 0.8109076852795446, etc.
您可以使用普通的 ol' RDD API 并切换回数据帧。
df.rdd
.map( c=> (c.getAs[String]("MeteVarID") , c.getAs[String]("Conc") ) )
.reduceByKey( _ +", "+ _)
.toDF("MeteVarID", "Conc")
.show(false)
+------------------------------------+------------------------------------------------------------------------------------------------------------------+
|MeteVarID |Conc |
+------------------------------------+------------------------------------------------------------------------------------------------------------------+
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 0 0.9604490986400536, Friday 1 0.8109076852795446, Friday 2 0.7282039568471731, Friday 3 0.5335418350493728|
+------------------------------------+------------------------------------------------------------------------------------------------------------------+