为什么 mongoio Reshuffle 不能处理数据流

Why mongoio Reshuffle cannot work on dataflow

我们尝试 运行 将数据汇入数据流 mongodb

         | "Write User Doc to Mongo" >> beam.io.WriteToMongoDB(uri=MONGO_URI,
                                                               db="db_name",
                                                               coll="col_name"
                                                               ))

出现错误IntervalWindow cannot be cast to org.apache.beam.sdk.transforms.windowing.GlobalWindow

java.lang.ClassCastException: org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast to 
org.apache.beam.sdk.transforms.windowing.GlobalWindow org.apache.beam.sdk.transforms.windowing.GlobalWindow$Coder.encode(
GlobalWindow.java:59) org.apache.beam.sdk.coders.Coder.encode(Coder.java:136) org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(
CoderUtils.java:82) org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66) 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51) 
org.apache.beam.sdk.util.CoderUtils.encodeToBase64(CoderUtils.java:151) 
org.apache.beam.runners.core.StateNamespaces$WindowNamespace.appendTo(StateNamespaces.java:116) 
org.apache.beam.runners.dataflow.worker.WindmillStateInternals.encodeKey(WindmillStateInternals.java:256) 
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.<init>(WindmillStateInternals.java:359) 
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.<init>(WindmillStateInternals.java:337) 
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$CachingStateTable.bindValue(WindmillStateInternals.java:174) 
org.apache.beam.runners.core.StateTags.bindValue(StateTags.java:69) org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(
StateSpecs.java:276) org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:266) 
org.apache.beam.runners.core.StateTags$SimpleStateTag.bind(StateTags.java:296) org.apache.beam.runners.core.StateTable.get(
StateTable.java:60) org.apache.beam.runners.dataflow.worker.WindmillStateInternals.state(WindmillStateInternals.java:334) 
org.apache.beam.runners.core.ReduceFnContextFactory$StateAccessorImpl.access(ReduceFnContextFactory.java:207) 
org.apache.beam.runners.core.triggers.TriggerStateMachineRunner.isClosed(TriggerStateMachineRunner.java:99) 
org.apache.beam.runners.core.ReduceFnRunner.windowsThatAreOpen(ReduceFnRunner.java:275) 
org.apache.beam.runners.core.ReduceFnRunner.processElements(ReduceFnRunner.java:345) 
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(
StreamingGroupAlsoByWindowViaWindowSetFn.java:94) 
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(
StreamingGroupAlsoByWindowViaWindowSetFn.java:42) 
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115) 
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73) 
org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80) 
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134) 
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) 
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) 
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201) 
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) 
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) 
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:125) 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1295) 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access00(StreamingDataflowWorker.java:149) 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.run(StreamingDataflowWorker.java:1028) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)

然后我删除 Reshuffle,管道似乎运行良好

  def expand(self, pcoll):
    return pcoll \
           | beam.ParDo(_GenerateObjectIdFn()) \
           # | Reshuffle() \
           | beam.ParDo(_WriteMongoFn(self._uri, self._db, self._coll,
                                      self._batch_size, self._spec))

为什么 reshuffle 对数据流不起作用?

我认为这是一个在 2.16 中得到修复的错误,https://issues.apache.org/jira/browse/BEAM-6723