火花流：select 记录数据帧中每个 ID 的最大时间戳 (pyspark)

Question

我有一个架构为 -

的数据框

 |-- record_id: integer (nullable = true)
 |-- Data1: string (nullable = true)
 |-- Data2: string (nullable = true)
 |-- Data3: string (nullable = true)
 |-- Time: timestamp (nullable = true)

我想检索数据中的最后一条记录，按 record_id 分组并具有最大时间戳。

所以，如果数据最初是这样的：

 +----------+---------+---------+---------+-----------------------+
 |record_id |Data1    |Data2    |Data3    |                   Time|
 +----------+---------+-------------------------------------------+
 |        1 | aaa     | null    |  null   | 2018-06-04 21:51:53.0 |
 |        1 | null    | bbbb    |  cccc   | 2018-06-05 21:51:53.0 |
 |        1 | aaa     | null    |  dddd   | 2018-06-06 21:51:53.0 |
 |        1 | qqqq    | wwww    |  eeee   | 2018-06-07 21:51:53.0 |
 |        2 | aaa     | null    |  null   | 2018-06-04 21:51:53.0 |
 |        2 | aaaa    | bbbb    |  cccc   | 2018-06-05 21:51:53.0 |
 |        3 | aaa     | null    |  dddd   | 2018-06-06 21:51:53.0 |
 |        3 | aaaa    | bbbb    |  eeee   | 2018-06-08 21:51:53.0 |

我希望输出为

 +----------+---------+---------+---------+-----------------------+
 |record_id |Data1    |Data2    |Data3    |                   Time|
 +----------+---------+-------------------------------------------+
 |        1 | qqqq    | wwww    |  eeee   | 2018-06-07 21:51:53.0 |
 |        2 | aaaa    | bbbb    |  cccc   | 2018-06-05 21:51:53.0 |
 |        3 | aaaa    | bbbb    |  eeee   | 2018-06-08 21:51:53.0 |

我尝试在同一个流中加入 2 个查询，类似于答案。我的代码（其中 df1 是原始数据框）：

df1=df1.withWatermark("Timetemp", "2 seconds")
df1.createOrReplaceTempView("tbl")
df1.printSchema()
query="select t.record_id as record_id, max(t.Timetemp) as Timetemp from tbl t group by t.record_id"
df2=spark.sql(query)
df2=df2.withWatermark("Timetemp", "2 seconds")

qws=df1.alias('a').join(df2.alias('b'),((col('a.record_id')==col('b.record_id')) & (col("a.Timetemp")==col("b.Timetemp"))))

query = qws.writeStream.outputMode('append').format('console').start()

query.awaitTermination()

不过我一直收到这个错误：

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;

当有明显水印时。可以做什么？我无法使用窗口化，因为流媒体不支持非基于时间的窗口化。

Answer 1

我也有同样的任务。尝试了几个选项，将 current_timestamp 列添加到数据集，并按 window 分组并记录 ID 加水印，但没有任何效果。

据我所知，没有 API 可以解决这个任务。 Window 分区和排序不适用于流数据集。

我使用 MapGroupWithState API 解决了这个任务，但没有保持如下状态：

import spark.implicits._

val stream = spark.readStream
  .option("maxFileAge", "24h")
  .option("maxFilesPerTrigger", "1000")
  .parquet(sourcePath)
  .as[input.Data]

val mostRecentRowPerPrimaryKey =
  stream
    .groupByKey(_.id)
    .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(takeMostRecentOnly)

mostRecentRowPerPrimaryKey
  .repartition(5)
  .writeStream
  .option("checkpointLocation", s"${streamingConfig.checkpointBasePath}/$streamName")
  .option("truncate", "false")
  .format("console")
  .outputMode(OutputMode.Update())
  .trigger(Trigger.ProcessingTime(60.seconds))
  .queryName(streamName)
  .start()

def takeMostRecentOnly(pk: Long, values: Iterator[input.Data], state: GroupState[input.Data]): input.Data = {
  values.maxBy(_.last_modified)
}

注意：这仅适用于 Update 模式。

希望对您有所帮助！

火花流：select 记录数据帧中每个 ID 的最大时间戳 (pyspark)

spark streaming: select record with max timestamp for each id in dataframe (pyspark)

apache-spark

apache-spark-sql

pyspark-sql

spark-structured-streaming