Google Dataflow 延迟数据

Google Dataflow late data

我一直在阅读 Dataflow SDK 文档,试图找出当数据到达流作业中的水印后会发生什么。

本页:

https://cloud.google.com/dataflow/model/windowing

表示如果你使用默认的window/trigger策略,那么迟到的数据将被丢弃:

Note: Dataflow's default windowing and trigger strategies discard late data. If you want to ensure that your pipeline handles instances of late data, you'll need to explicitly set .withAllowedLateness when you set your PCollection's windowing strategy and set triggers for your PCollections accordingly.

还有这个页面:

https://cloud.google.com/dataflow/model/triggers

表示延迟到达的数据将作为单个元素 PCollection 发出:

The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window. The default trigger emits on a repeating basis, meaning that any late data will by definition arrive after the watermark and trip the trigger, causing the late elements to be emitted as they arrive.

那么,超过水印的迟到数据会被完全丢弃吗?或者,如果它及时到达,它会不会与其他数据一起发出,而是自己发出?

默认"windowing and trigger strategies"丢弃迟到的数据。 WindowingStrategy 是一个对象,由窗口、触发和一些其他参数(例如允许的延迟)组成。默认允许的迟到为 0,因此任何迟到的数据元素都将被丢弃。

默认触发器处理延迟数据。如果您采用默认 WindowingStrategy 并仅更改允许的迟到,那么您将收到一个 PCollection,其中包含一个用于所有准时数据的输出窗格,然后是一个新的输出窗格,用于大约每一次迟到元素.