Spark groupby 聚合

Spark groupby aggregations

我正在尝试按聚合进行分组。使用 Spark 1.5.2

你能告诉我为什么这不起作用吗?

in 是一个数据框。

scala> in
res28: org.apache.spark.sql.DataFrame = [id: int, city: string]




scala> in.show
+---+--------+
| id|    city|
+---+--------+
| 10|Bathinda|
| 20|Amritsar|
| 30|Bathinda|
+---+--------+

scala>in.groupBy("city").agg(Map{
 |       "id" -> "sum"
 |     }).show(true)
+----+-------+
|city|sum(id)|
+----+-------+
+----+-------+

谢谢,

我希望输出应该有城市和 id 的总和

编辑:我不知道为什么下次我创建新的 spark-shell

时它会起作用

考虑以下数据帧:

val in = sc.parallelize(Seq(
  (10, "Bathinda"), (20, "Amritsar"), (30, "Bathinda"))).toDF("id", "city")

您可以看到这些代码行将给出相同的输出

scala> in.groupBy("city").agg(Map("id" -> "sum")).show
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(Map{ "id" -> "sum"}).show
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(Map{ "id" -> "sum"}).show(true)
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(sum($"id")).show(true)
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(sum(in("id"))).show(true)
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

注意: show 参数默认为 false,它只关心是否显示整个字段值。 (有时字段太长,您只需要预览)