理解“|”和 Beam Python 示例中的“>>”
Understanding "|" and ">>" from the Beam Python example
我有一些关于 Python 的应用知识,但对 Apache Beam 还很陌生。我遇到了一个来自 Apache Beam 的关于简单字数统计程序的例子。我感到困惑的片段如下所示:
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)
完整版代码在这里:https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
我对“|”感到很困惑和“>>”运算符在这里使用。他们在这里是什么意思? Python 是否原生支持它们?
由于此代码是用 Beam
编写的,因此您所说的符号是 Beam Pipeline
的原生符号。
|
是管道符号,表示给定操作所寻址的管道:就像在您的示例中一样,p
是 lines = p | ReadFromText(known_args.input)
的源管道,而 lines
是源
管道
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
>>
为某个操作命名,以便于在 UI.
上阅读
在你的例子中,'GroupAndSum' >> beam.CombinePerKey(sum))
,GroupAndSum
是组合操作的名称等等。
阅读@Klaus D. 在评论中提供的文档以获得更多清晰度。
我有一些关于 Python 的应用知识,但对 Apache Beam 还很陌生。我遇到了一个来自 Apache Beam 的关于简单字数统计程序的例子。我感到困惑的片段如下所示:
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)
完整版代码在这里:https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
我对“|”感到很困惑和“>>”运算符在这里使用。他们在这里是什么意思? Python 是否原生支持它们?
由于此代码是用 Beam
编写的,因此您所说的符号是 Beam Pipeline
的原生符号。
|
是管道符号,表示给定操作所寻址的管道:就像在您的示例中一样,p
是 lines = p | ReadFromText(known_args.input)
的源管道,而 lines
是源
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
>>
为某个操作命名,以便于在 UI.
在你的例子中,'GroupAndSum' >> beam.CombinePerKey(sum))
,GroupAndSum
是组合操作的名称等等。
阅读@Klaus D. 在评论中提供的文档以获得更多清晰度。