数据帧聚合

Question

我有一个具有以下结构的数据框 DF：

ID, DateTime, Latitude, Longitude, otherArgs

我想按 ID 和时间对我的数据进行分组 window，并保留有关位置的信息（例如分组纬度的平均值和分组经度的平均值）

我使用 :

成功获得了一个新的数据框，其中包含按 ID 和时间分组的数据

DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))

但是这样做会丢失我的位置数据。

我正在寻找的是这样的东西，例如：

DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))

每个 ID 和时间只返回一行 window。

编辑：

示例输入： DF : ID, DateTime, Latitude, Longitude, otherArgs

0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings

示例输出： DF : ID, window(日期时间), 纬度, 经度

0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5

Answer 1

您应该使用 .agg() 方法进行聚合

也许这就是你的意思？

DF
  .groupBy(
    'ID,
    window('DateTime, "2 minutes")
  )
  .agg(
    mean("latitude").as("latitudeMean"),
    mean("longitude").as("longitudeMean")        
  )

Answer 2

这是您可以做的，您需要将 mean 与 aggregation 一起使用。

val df = Seq(
  (0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
  (0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
  (1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
  (1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))

df.groupBy($"id", window($"DateTime", "2 minutes"))
  .agg(
    mean("Latitude").as("lat"),
    mean("Longitude").as("long")
  )
.show(false)

输出：

+---+---------------------------------------------+----+----+
|id |window                                       |lat |long|
+---+---------------------------------------------+----+----+
|1  |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0  |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+

数据帧聚合

Dataframe Aggregation

scala

aggregation

apache-spark

apache-spark-sql