Apache beam python 流式写入每小时 avro 文件文件
Apache beam python streaming writing hourly avro files files
从 pubsub 获取消息然后将其保存到 gcs 上的每小时或其他间隔文件中不起作用。该作业仅在我关闭作业时写入文件。谁能指出我正确的方向?
topic = 'test.txt'
jobname = 'streaming-' + topic.replace('.', '-')
input_topic= 'projects/PROJECT/topics/' + topic
u = Utils()
parsed_schema = u.get_parsed_avro_from_schema_service(
schema_name=topic,
schema_repo_url='localhost'
)
p = beam.Pipeline(options=pipelineoptions)
messages = p | 'Read from topic: ' + topic >> ReadFromPubSub(topic=input_topic).with_input_types(bytes)
windowed_lines = (
messages
| 'decode' >> beam.ParDo(DecodeAvro(), parsed_schema)
| beam.WindowInto(
window.FixedWindows(60),
trigger=AfterWatermark(),
accumulation_mode=AccumulationMode.DISCARDING
)
)
output = windowed_lines | 'write result' >> WriteToAvro(
file_path_prefix='gs://BUCKET/streaming/tests/',
shard_name_template=topic.split('.')[0] + '_' + str(uuid.uuid4()) + '_SSSS-of-NNNN',
schema=parsed_schema,
file_name_suffix='.avro',
num_shards=2
)
result = p.run()
result.wait_until_finish()
经过更多研究,我发现 python sdk 尚不支持从无界源写入有界源。所以我将不得不为此更改为 Java sdk。
从 pubsub 获取消息然后将其保存到 gcs 上的每小时或其他间隔文件中不起作用。该作业仅在我关闭作业时写入文件。谁能指出我正确的方向?
topic = 'test.txt'
jobname = 'streaming-' + topic.replace('.', '-')
input_topic= 'projects/PROJECT/topics/' + topic
u = Utils()
parsed_schema = u.get_parsed_avro_from_schema_service(
schema_name=topic,
schema_repo_url='localhost'
)
p = beam.Pipeline(options=pipelineoptions)
messages = p | 'Read from topic: ' + topic >> ReadFromPubSub(topic=input_topic).with_input_types(bytes)
windowed_lines = (
messages
| 'decode' >> beam.ParDo(DecodeAvro(), parsed_schema)
| beam.WindowInto(
window.FixedWindows(60),
trigger=AfterWatermark(),
accumulation_mode=AccumulationMode.DISCARDING
)
)
output = windowed_lines | 'write result' >> WriteToAvro(
file_path_prefix='gs://BUCKET/streaming/tests/',
shard_name_template=topic.split('.')[0] + '_' + str(uuid.uuid4()) + '_SSSS-of-NNNN',
schema=parsed_schema,
file_name_suffix='.avro',
num_shards=2
)
result = p.run()
result.wait_until_finish()
经过更多研究,我发现 python sdk 尚不支持从无界源写入有界源。所以我将不得不为此更改为 Java sdk。