为什么 mongoio Reshuffle 不能处理数据流
Why mongoio Reshuffle cannot work on dataflow
我们尝试 运行 将数据汇入数据流 mongodb
| "Write User Doc to Mongo" >> beam.io.WriteToMongoDB(uri=MONGO_URI,
db="db_name",
coll="col_name"
))
出现错误IntervalWindow cannot be cast to
org.apache.beam.sdk.transforms.windowing.GlobalWindow
java.lang.ClassCastException: org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast to
org.apache.beam.sdk.transforms.windowing.GlobalWindow org.apache.beam.sdk.transforms.windowing.GlobalWindow$Coder.encode(
GlobalWindow.java:59) org.apache.beam.sdk.coders.Coder.encode(Coder.java:136) org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(
CoderUtils.java:82) org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
org.apache.beam.sdk.util.CoderUtils.encodeToBase64(CoderUtils.java:151)
org.apache.beam.runners.core.StateNamespaces$WindowNamespace.appendTo(StateNamespaces.java:116)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals.encodeKey(WindmillStateInternals.java:256)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.<init>(WindmillStateInternals.java:359)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.<init>(WindmillStateInternals.java:337)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$CachingStateTable.bindValue(WindmillStateInternals.java:174)
org.apache.beam.runners.core.StateTags.bindValue(StateTags.java:69) org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(
StateSpecs.java:276) org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:266)
org.apache.beam.runners.core.StateTags$SimpleStateTag.bind(StateTags.java:296) org.apache.beam.runners.core.StateTable.get(
StateTable.java:60) org.apache.beam.runners.dataflow.worker.WindmillStateInternals.state(WindmillStateInternals.java:334)
org.apache.beam.runners.core.ReduceFnContextFactory$StateAccessorImpl.access(ReduceFnContextFactory.java:207)
org.apache.beam.runners.core.triggers.TriggerStateMachineRunner.isClosed(TriggerStateMachineRunner.java:99)
org.apache.beam.runners.core.ReduceFnRunner.windowsThatAreOpen(ReduceFnRunner.java:275)
org.apache.beam.runners.core.ReduceFnRunner.processElements(ReduceFnRunner.java:345)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(
StreamingGroupAlsoByWindowViaWindowSetFn.java:94)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(
StreamingGroupAlsoByWindowViaWindowSetFn.java:42)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:125)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1295)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access00(StreamingDataflowWorker.java:149)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.run(StreamingDataflowWorker.java:1028)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)
然后我删除 Reshuffle
,管道似乎运行良好
def expand(self, pcoll):
return pcoll \
| beam.ParDo(_GenerateObjectIdFn()) \
# | Reshuffle() \
| beam.ParDo(_WriteMongoFn(self._uri, self._db, self._coll,
self._batch_size, self._spec))
为什么 reshuffle 对数据流不起作用?
我认为这是一个在 2.16 中得到修复的错误,https://issues.apache.org/jira/browse/BEAM-6723
我们尝试 运行 将数据汇入数据流 mongodb
| "Write User Doc to Mongo" >> beam.io.WriteToMongoDB(uri=MONGO_URI,
db="db_name",
coll="col_name"
))
出现错误IntervalWindow cannot be cast to
org.apache.beam.sdk.transforms.windowing.GlobalWindow
java.lang.ClassCastException: org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast to
org.apache.beam.sdk.transforms.windowing.GlobalWindow org.apache.beam.sdk.transforms.windowing.GlobalWindow$Coder.encode(
GlobalWindow.java:59) org.apache.beam.sdk.coders.Coder.encode(Coder.java:136) org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(
CoderUtils.java:82) org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
org.apache.beam.sdk.util.CoderUtils.encodeToBase64(CoderUtils.java:151)
org.apache.beam.runners.core.StateNamespaces$WindowNamespace.appendTo(StateNamespaces.java:116)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals.encodeKey(WindmillStateInternals.java:256)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.<init>(WindmillStateInternals.java:359)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.<init>(WindmillStateInternals.java:337)
org.apache.beam.runners.dataflow.worker.WindmillStateInternals$CachingStateTable.bindValue(WindmillStateInternals.java:174)
org.apache.beam.runners.core.StateTags.bindValue(StateTags.java:69) org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(
StateSpecs.java:276) org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:266)
org.apache.beam.runners.core.StateTags$SimpleStateTag.bind(StateTags.java:296) org.apache.beam.runners.core.StateTable.get(
StateTable.java:60) org.apache.beam.runners.dataflow.worker.WindmillStateInternals.state(WindmillStateInternals.java:334)
org.apache.beam.runners.core.ReduceFnContextFactory$StateAccessorImpl.access(ReduceFnContextFactory.java:207)
org.apache.beam.runners.core.triggers.TriggerStateMachineRunner.isClosed(TriggerStateMachineRunner.java:99)
org.apache.beam.runners.core.ReduceFnRunner.windowsThatAreOpen(ReduceFnRunner.java:275)
org.apache.beam.runners.core.ReduceFnRunner.processElements(ReduceFnRunner.java:345)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(
StreamingGroupAlsoByWindowViaWindowSetFn.java:94)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(
StreamingGroupAlsoByWindowViaWindowSetFn.java:42)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:125)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1295)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access00(StreamingDataflowWorker.java:149)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.run(StreamingDataflowWorker.java:1028)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)
然后我删除 Reshuffle
,管道似乎运行良好
def expand(self, pcoll):
return pcoll \
| beam.ParDo(_GenerateObjectIdFn()) \
# | Reshuffle() \
| beam.ParDo(_WriteMongoFn(self._uri, self._db, self._coll,
self._batch_size, self._spec))
为什么 reshuffle 对数据流不起作用?
我认为这是一个在 2.16 中得到修复的错误,https://issues.apache.org/jira/browse/BEAM-6723