在 DirectRunner 上运行时，开窗似乎有效，但在 Cloud Dataflow 上运行时无效

Question

我正在尝试使用 GroupByKey 打破融合。这会创建一个巨大的 window，因为我的工作量很大，所以我宁愿开始发射。

直接运行ner 使用类似我发现的东西似乎有效。但是，当在 Cloud Dataflow 上运行时，它似乎将 GBK 一起批处理，并且在源节点具有 "succeeded".

之前不会发出输出

我正在做 bounded/batch 工作。我正在提取存档文件的内容，然后将它们写入 gcs。

一切正常，只是花费的时间比我预期的要长，而且 cpu 利用率很低。我怀疑这是由于融合——我的假设是提取与写入操作融合，因此存在提取模式 / 较高 CPU 后跟较少 CPU 因为我们正在做网络又来又回。

代码如下：

.apply("Window",
  Window.<MyType>into(new GlobalWindows())
    .triggering(
      Repeatedly.forever(                             
        AfterProcessingTime.pastFirstElementInPane()                                
          .plusDelayOf(Duration.standardSeconds(5))))
        .withAllowedLateness(Duration.ZERO)
        .discardingFiredPanes()
)
.apply("Add key", MapElements...)
.apply(GroupByKey.create())

我在本地使用调试日志进行验证，以便我可以看到在 GBK 之后正在完成的工作。第一次提取完成和第一个 post-GBK op 之间的时间戳通常反映了 5s 的持续时间（或我将其更改为 (1,5,10,20,30) 的其他值）。

在 GCP 上，我通过查看管道结构进行验证，我可以看到 GBK 之后的所有内容都是 "not started" 并且 GBK 的输出集合为空（“-”），而输入集合有数百万元素数。

编辑：

这是在 beam v2.10.0 上。
提取是由 SplittableDoFn 完成的（不确定这是否相关）

Answer 1

看起来您提到的答案是针对流式管道（无界输入）的。对于批处理管道处理有界输入，GroupByKey 在处理给定键的所有数据之前不会发出。请参阅 here 了解更多详情。

在 DirectRunner 上运行时，开窗似乎有效，但在 Cloud Dataflow 上运行时无效

Windowing appears to work when running on the DirectRunner, but not when running on Cloud Dataflow

google-cloud-dataflow

apache-beam

在 DirectRunner 上 运行 时，开窗似乎有效，但在 Cloud Dataflow 上 运行 时无效

Windowing appears to work when running on the DirectRunner, but not when running on Cloud Dataflow

google-cloud-dataflow

apache-beam

在 DirectRunner 上运行时，开窗似乎有效，但在 Cloud Dataflow 上运行时无效