如何将 DoFn PTransform 应用于 Apache Beam 中的 PCollectionTuple

Question

我正在尝试将 PTransform 应用于 PCollectionTuple，但无法弄清楚为什么编译器会报错。

我想这样做是为了将连接一些 csv 行所需的多个步骤抽象为单个 PTransform（PCollectionTuple 中的每个 PCollection 都包含要连接的 csv 行），我遇到的问题不是加入自身，但如何将 PTransform 应用于 PCollectionTuple。

这是我的代码：

static class JoinCsvLines extends DoFn<PCollectionTuple, String[]> {
        @ProcessElement
        public void processElement(ProcessContext context) {
            PCollectionTuple element = context.element();
            // TODO: Implement the output
        }
    }

我这样调用 PTransform：

TupleTag<String[]> tag1 = new TupleTag<>();
TupleTag<String[]> tag2 = new TupleTag<>();
PCollectionTuple toJoin = PCollectionTuple.of(tag1, csvLines1).and(tag2, csvLines2);

// Can't compile this line
PCollection<String[]> joinedLines = toJoin.apply("JoinLines", ParDo.of(new JoinCsvLines()));

当我将鼠标悬停在未编译的行上方时，IntelliJ IDEA 输出以下内容：

Required type:
PTransform
<? super PCollectionTuple,
OutputT>
Provided:
SingleOutput
<PCollectionTuple,
String[]>
reason: no instance(s) of type variable(s) InputT exist so that PCollectionTuple conforms to PCollection<? extends InputT>

如何将 PTransform 应用于 PCollectionTuple？

Answer 1

DoFn<PCollectionTuple, String[]> 表示您希望对每条记录应用 "DoFn"，因此您不应使用 PCollectionTuple 作为输入类型。相反，您应该使用 "csvLines1" 和 "csvLines2".

的类型

如果您打算合并两个 PCollections，您可以勾选 Flatten transform：https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Flatten.java#L41

如何将 DoFn PTransform 应用于 Apache Beam 中的 PCollectionTuple

How to apply a DoFn PTransform to a PCollectionTuple in Apache Beam

java

dataflow

google-cloud-dataflow

apache-beam