“|”是什么意思“>>”在 Apache Beam 中是什么意思?
What do the "|" and ">>" means in Apache Beam?
我正在尝试了解 Apache Beam。我在关注 programming guide,在一个例子中,他们说谈论 The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection.
。
我很惊讶,因为我在任何时候都没有看到 ParDo
操作,所以我开始怀疑 |
是否真的是 ParDo
。代码如下所示:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy@example.com'),
('carl', 'carl@example.com'),
('julia', 'julia@example.com'),
('carl', 'carl@email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
我确实注意到 emails
和 phones
是在管道的开头读取的,所以我猜它们是不同的 PCollections
,对吧?但是 ParDo
在哪里执行? “|”是什么意思而“>>”究竟是什么意思?我如何才能看到它的实际输出? join_info
函数、emails_list
和 phones_list
是否定义在 DAG 外部是否重要?
中的|
表示步骤之间的分隔,这是(using p
as Pbegin
):p | ReadFromText(..) | ParDo(..) | GroupByKey()
.
你也可以在|
之前引用其他PCollections
:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
这相当于之前的管道:p | ReadFromText(..) | ParDo(..) | GroupByKey()
在|
和PTransform
之间使用>>
来命名步骤:p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()
我正在尝试了解 Apache Beam。我在关注 programming guide,在一个例子中,他们说谈论 The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection.
。
我很惊讶,因为我在任何时候都没有看到 ParDo
操作,所以我开始怀疑 |
是否真的是 ParDo
。代码如下所示:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy@example.com'),
('carl', 'carl@example.com'),
('julia', 'julia@example.com'),
('carl', 'carl@email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
我确实注意到 emails
和 phones
是在管道的开头读取的,所以我猜它们是不同的 PCollections
,对吧?但是 ParDo
在哪里执行? “|”是什么意思而“>>”究竟是什么意思?我如何才能看到它的实际输出? join_info
函数、emails_list
和 phones_list
是否定义在 DAG 外部是否重要?
中的|
表示步骤之间的分隔,这是(using p
as Pbegin
):p | ReadFromText(..) | ParDo(..) | GroupByKey()
.
你也可以在|
之前引用其他PCollections
:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
这相当于之前的管道:p | ReadFromText(..) | ParDo(..) | GroupByKey()
在|
和PTransform
之间使用>>
来命名步骤:p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()