如何从 PCollection<String> 创建 PCollection<Row> 以执行光束 SQL 变换

Question

我正在尝试实现一个数据管道，它连接来自 Kafka 主题的多个无限源。我能够连接到主题并将数据获取为 PCollection<String>，我需要将其转换为 PCollection<Row>。我将逗号分隔的字符串拆分为一个数组，并使用模式将其转换为行。但是，如何 implement/build 架构并将值动态绑定到它？

即使我创建一个单独的 class 用于模式构建，有没有办法将字符串数组直接绑定到模式？

下面是我当前的工作代码，它是静态的，每次我构建管道时都需要重写，它也会根据字段的数量而延长。

final Schema sch1 =
                Schema.builder().addStringField("name").addInt32Field("age").build();

PCollection<KafkaRecord<Long, String>> kafkaDataIn1 = pipeline
  .apply(
    KafkaIO.<Long, String>read()
      .withBootstrapServers("localhost:9092")
      .withTopic("testin1")
      .withKeyDeserializer(LongDeserializer.class)
      .withValueDeserializer(StringDeserializer.class)
      .updateConsumerProperties(
         ImmutableMap.of("group.id", (Object)"test1")));

PCollection<Row> Input1 = kafkaDataIn1.apply(
  ParDo.of(new DoFn<KafkaRecord<Long, String>, Row>() {
    @ProcessElement
    public void processElement(
        ProcessContext processContext,
        final OutputReceiver<Row> emitter) {

          KafkaRecord<Long, String> record = processContext.element();
          final String input = record.getKV().getValue();

          final String[] parts = input.split(",");

          emitter.output(
            Row.withSchema(sch1)
               .addValues(
                   parts[0],
                   Integer.parseInt(parts[1])).build());
        }}))
  .apply("window",
     Window.<Row>into(FixedWindows.of(Duration.standardSeconds(50)))
       .triggering(AfterWatermark.pastEndOfWindow())
       .withAllowedLateness(Duration.ZERO)
       .accumulatingFiredPanes());

Input1.setRowSchema(sch1);

我的期望是以dynamically/reusable方式实现与上面代码相同的东西。

Answer 1

模式是在 pcollection 上设置的，所以它不是动态的，如果你想懒惰地构建它，那么你需要使用 format/coder 支持它。 Java 序列化或 json 是示例。

据说受益于 sql 功能，您还可以使用带有查询字段和其他字段的静态模式，这样静态部分可以为您做 sql 而且您不会丢失额外的数据.

罗曼

如何从 PCollection<String> 创建 PCollection<Row> 以执行光束 SQL 变换

How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms

java

join

apache-kafka

apache-beam