ParDo 中分区和多个输出之间的区别？

Question

我是 Apache Beam 的新手，正在使用 Python SDK。假设我有一个 PCollection，其中包含一些如下所示的元素：

{"item": "foo", "color": "green", "date": "2020-10-30"}
{"item": "bar", "color": "blue", "date": "2020-10-30"}
{"item": "bar", "color": "green", "date": "2020-10-30"}
{"item": "foo", "color": "blue", "date": "2020-10-30"}

如果我想根据某些元素属性将其拆分为多个 PCollections，似乎我可以选择 Partition 或 ParDo 带有标签（并在 with_outputs() 时使用ParDo 被调用）。

当我使用 Partition 而不是 ParDo 时，是否有指导方针？似乎 Partition 用于拆分 PCollection，其中生成的 PCollection 都具有相同的架构 (link), whereas a ParDo could be used to accomplish that, but is better used for splitting a PCollection into multiple PCollections each with a different schema (link)。我对文档的理解是否正确？

Answer 1

ParDo 指定通用并行处理，运行器将管理此“扇出”，而 Partition 无意并行但它旨在将集合拆分为 sub-collections 的序列，其逻辑由您创建的函数确定。

partition 的典型用户案例可以是按百分位数对学生进行分组并将组传递到相应的下游步骤。注意不同的学生组可以有不同的下游过程，这不是 ParDo 设计的目的。

另外，Partition与ParDo的另一个区别是前者必须有一个预定义的分区号，而后者则没有。概念。

ParDo 中分区和多个输出之间的区别？

Difference Between Partitions and Multiple Outputs in a ParDo?

python

apache-beam