为什么流数据集失败并显示 "Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets... "?

Why does streaming Dataset fail with "Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets... "?

我使用 Spark 2.2.0,在 windows 上使用 Spark Structured Streaming 时出现以下错误:

Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark.

Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark

Streaming 聚合要求您告诉 Spark Structured Streaming 引擎何时输出聚合(根据所谓的 输出模式 ),因为可能是聚合一部分的数据可能迟到,一段时间后才有空。

“某个时间”部分是事件延迟,描述为从当前时间到水印之前的时间。

这就是为什么您必须指定水印以使 Spark drop/disregard 任何延迟事件并停止累积最终可能导致 OutOfMemoryError 或类似错误的状态。

话虽如此,您应该在流数据集上使用 withWatermark 运算符。

withWatermark Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.

并引用...

Spark will use this watermark for several purposes:

  • To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
  • To minimize the amount of state that we need to keep for on-going aggregations, mapGroupsWithState and dropDuplicates operators.

The current watermark is computed by looking at the MAX(eventTime) seen across all of the partitions in the query minus a user specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time. In some cases we may still process records that arrive more than delayThreshold late.

查看 Spark Structured Streaming 的 Handling Late Data and Watermarking