数据流根据数据设置 BigQuery 分区

Dataflow to set BigQuery partition based on the data

我如何配置 Dataflow 以根据传入数据动态设置 BigQuery table 分区?

我采用的方法如下:

  • 为传入记录设置 window
  • 将window转换为table名称

    p.apply(PubsubIO.Read
                .subscription(subscription)
                .withCoder(TableRowJsonCoder.of())
            )
            .apply(Window.into(new TablePartitionWindowFn()) )
            .apply(BigQueryIO.Write
                           .to(new DayPartitionFunc(dataset, table))
                           .withSchema(schema)
                           .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            );
    

根据传入数据设置window,结束Instant可以忽略,因为起始值用于设置分区:

public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {

private IntervalWindow assignWindow(AssignContext context) {
    TableRow source = (TableRow) context.element();
    String dttm_str = (String) source.get("DTTM");

    DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();

    Instant start_point = Instant.parse(dttm_str,formatter);
    Instant end_point = start_point.withDurationAdded(1000, 1);

    return new IntervalWindow(start_point, end_point);
};

动态设置 table 分区:

public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {

String destination = "";

public DayPartitionFunc(String dataset, String table) {
    this.destination = dataset + "." + table+ "$";
}

@Override
public String apply(BoundedWindow boundedWindow) {
    // The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
    String dayString = DateTimeFormat.forPattern("yyyyMMdd")
                                     .withZone(DateTimeZone.UTC)
                                     .print(((IntervalWindow) boundedWindow).start());
    return destination + dayString;
}}

如果有更好的方法可以达到同样的效果,请告诉我。