使用 Dataflow [apache beam] 第二次从 Big Query 中提取数据的问题

Question

我需要使用数据流从 BigQuery table 中提取数据并写入 GCS 存储桶。
数据流是使用 apache beam (Java) 构建的。数据流首次从BigQuery中提取并完美写入GCS

但是，当第一个管道成功执行后，第二个数据流启动以从同一个 table 中提取数据时，它不会从 Big Query 中提取任何数据。我在 stackdriver 日志中看到的唯一错误是

Blockquote "Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes, HTTP framework says request can be retried, (caller responsible for retrying): https://www.googleapis.com/bigquery/v2/projects/dataflow-begining/jobs"

我用来提取的示例代码是

 pipeline.apply("Extract from BQ", BigQueryIO.readTableRows().fromQuery("SELECT * from bq_test.employee"))

感谢任何帮助

Answer 1

我以前在使用模板时见过这种情况。根据文档 here，在 Usage with templates 部分：

When using read() or readTableRows() in a template, it's required to specify BigQueryIO.Read.withTemplateCompatibility(). Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.

并在 withTemplateCompatibility 部分：

Use new template-compatible source implementation. This implementation is compatible with repeated template invocations.

如果是这样，您应该使用：

pipeline.apply("Extract from BQ", BigQueryIO
        .readTableRows()
        .withTemplateCompatibility()
        .fromQuery("SELECT * from bq_test.employee"))

使用 Dataflow [apache beam] 第二次从 Big Query 中提取数据的问题

Issues in Extracting data from Big Query from second time using Dataflow [ apache beam ]

google-bigquery

google-cloud-platform

google-cloud-dataflow

apache-beam