Beam：每 window 个元素计数写入 window 个边界

Question

为了简单的概念验证，我正在尝试 window 在两分钟内 windows 点击数据。我想从那里做的就是打印每个 window 计数，以及 BigQuery 的 windows' 边界。在运行我的管道上，我不断收到以下错误：

org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"windowend","message":"This field is not a record.","reason":"invalid"}],"index":0}]

管道看起来像这样：

// Creating the pipeline
Pipeline p = Pipeline.create(options);

// Window items
PCollection<TableRow> counts = p.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getTopic()))
.apply("AddEventTimestamps", WithTimestamps.of(TotalCountPipeline::ExtractTimeStamp).withAllowedTimestampSkew(Duration.standardDays(10000)))
        .apply("Window", Window.<String>into(
                FixedWindows.of(Duration.standardHours(options.getWindowSize())))
                .triggering(
                        AfterWatermark.pastEndOfWindow()
                                .withLateFirings(AfterPane.elementCountAtLeast(1)))
                .withAllowedLateness(Duration.standardDays(10000))
                .accumulatingFiredPanes())
        .apply("CalculateSum", Combine.globally(Count.<String>combineFn()).withoutDefaults())
        .apply("BigQueryFormat", ParDo.of(new FormatCountsFn()));

// Writing to BigQuery
counts.apply("WriteToBigQuery",BigQueryIO.writeTableRows()
                .to(options.getOutputTable())
                .withSchema(getSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

// Execute pipeline
p.run().waitUntilFinish();

估计跟BigQuery的格式化函数有关，具体实现如下：

static class FormatCountsFn extends DoFn<Long, TableRow> {
    @ProcessElement
    public void processElement(ProcessContext c, BoundedWindow window) {
        TableRow row =
                new TableRow()
                        .set("windowStart", window.maxTimestamp().toDateTime())
                        .set("count", c.element().intValue());
        c.output(row);
    }
}

灵感来自 this post。任何人都可以阐明这一点吗？似乎无法理解它。

Answer 1

显然这个问题的答案与波束窗口无关，仅与 BigQuery 有关。将 DateTime 对象写入 BigQuery 行需要一个采用正确 yyyy-MM-dd HH:mm:ss 格式的字符串，这与我提供的 DateTime 对象形成对比。

Beam：每 window 个元素计数写入 window 个边界

Beam: writing per window element count with window boundaries

java

google-bigquery

google-cloud-dataflow

apache-beam