如何指定流-流连接的保留时间？

How to specify retention time for stream-stream joins?

我想了解 spark 中结构化流的保留时间。

我有不同的 spark 结构化流：

流A：从时间t0开始每10秒到达一次；
流B：从时间t0开始每10秒到达一次；
流C：从时间t1开始每10秒到达一次；

我需要使用 pandas udf 对这些数据应用机器学习模型。流 A 和流 B 独立运行。来自流 C 的数据在被处理之前需要与流 A 和 B 连接。

我的问题是：如何保证Stream A和Stream B处理过的数据不被丢弃？仅使用水印就足以实现这一目标吗？

how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?

没错。 stream-stream join 的状态永远保持，所以你的第一个问题是开箱即用的，而第二个问题需要水印和“附加连接条件”.

引用 Inner Joins with optional Watermarking:

Inner joins on any kind of columns along with any kind of join conditions are supported. However, as the stream runs, the size of streaming state will keep growing indefinitely as all past input must be saved as any new input can match with any input from the past. To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state.

Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)

Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input.

如何指定流-流连接的保留时间？

How to specify retention time for stream-stream joins?

apache-spark

spark-structured-streaming