BigQuery 数据转换的最佳方法

Question

我已经在 BigQuery 上存储了数 TB 的数据，我想对其执行大量数据转换。

考虑到成本和性能，你们建议执行这些转换以供将来在 BigQuery 上使用这些数据的最佳方法是什么？

我正在考虑几个选项：
1. 从 DataFlow 读取原始数据，然后将转换后的数据加载回 BigQuery？
2. 直接从 BigQuery 执行？

关于如何进行此操作的任何想法？

Answer 1

我写下了一些关于性能的最重要的事情，你可以在那里找到关于使用 DataFlow 的问题的考虑。

考虑性能的最佳实践：

正在选择文件格式：

BigQuery 支持多种用于数据提取的文件格式。有些人自然会比其他人更快。在优化加载速度时，更喜欢使用 AVRO 文件格式，它是二进制的、基于行的格式，可以将其拆分，然后由多个工作人员并行读取。

从压缩文件加载数据，特别是 CSV 和 JSON，将比加载其他格式的数据慢。原因是，由于 Gzip 的压缩不是 splitable，因此需要将该文件加载到 BQ 中的插槽中，然后进行解压缩，最后并行加载之后

**FASTER**
Avro(Compressed)
Avro(Uncompressed)
Parquet/ORC
CSV
JSON
CSV (Compressed)
JSON(Compressed
**SLOWER**

ELT/ETL：

将数据加载到 BQ 后，您可以考虑转换（ELT 或 ETL）。所以一般来说，您希望尽可能选择 ELT 而不是 ETL。 BQ 具有很强的可扩展性，可以处理大量数据的大型转换。 ELT 也更简单一些，因为您只需编写一些 SQL 查询，转换一些数据，然后在 table 之间移动数据，而不必担心管理单独的 ETL 应用程序。

原始和暂存 tables:

一旦您开始将数据加载到 BQ 中，一般来说，在您的仓库中，您将希望在发布到报告 table 之前利用原始和暂存 table。原始 table 本质上包含完整的每日摘录，或他们正在加载的全部数据。登台 table 基本上就是您的更改数据捕获 table，因此您可以利用查询或 DML 将该数据合并到您的登台 table 中，并拥有所有插入数据的完整历史记录.最后，您的报告 table 将成为您向用户发布的内容。

使用 DataFlow 加速管道：

当您进入流式加载非常复杂的批处理加载（这并不完全适合 SQL）时，您可以利用 DataFlow 或 DataFusion 来加速这些管道，并执行更复杂的活动在那个数据上。如果您开始使用流式传输，我建议使用 DataFlow 模板 - Google 提供它用于从多个不同位置加载数据并四处移动数据。您可以在 DataFlow UI 中找到这些模板，在 Create Job from Template 按钮中，您会找到所有这些模板。如果您发现它最适合您的用例，但想稍作修改，所有这些模板也是开源的（因此您可以去 repo，修改代码以满足您的需要）。

分区：

BQ 中的分区根据摄取时间或数据中的列物理拆分磁盘上的数据。高效查询 table 你想要的部分。这提供了巨大的成本和性能优势，尤其是在大型事实 table 上。每当您有事实 table 或时间 table 时，请在日期维度上使用分区列。

集群经常访问的字段：

集群允许您对分区内的数据进行物理排序。因此，您可以通过一个或多个键进行聚类。如果使用得当，这会带来巨大的性能优势。

BQ 预订：

它允许创建槽预留，将项目分配给这些预留，因此您可以为某些类型的查询分配更多或更少的资源。

您可以在 official documentation.

中找到考虑节省成本的最佳做法

希望对你有所帮助

Answer 2

根据this Google Cloud Documentation，要在 ELT 的 DataFlow 或 BigQuery 工具之间进行选择，应完成以下问题。

Although the data is small and can quickly be uploaded by using the BigQuery UI, for the purpose of this tutorial you can also use Dataflow for ETL. Use Dataflow for ETL into BigQuery instead of the BigQuery UI when you are performing massive joins, that is, from around 500-5000 columns of more than 10 TB of data, with the following goals:

You want to clean or transform your data as it's loaded into BigQuery, instead of storing it and joining afterwards. As a result, this approach also has lower storage requirements because data is only stored in BigQuery in its joined and transformed state.

You plan to do custom data cleansing (which cannot be simply achieved with SQL).

You plan to combine the data with data outside of the OLTP, such as logs or remotely accessed data, during the loading process.

You plan to automate testing and deployment of data-loading logic using continuous integration or continuous deployment (CI/CD).

You anticipate gradual iteration, enhancement, and improvement of the ETL process over time.

You plan to add data incrementally, as opposed to performing a one-time ETL.

BigQuery 数据转换的最佳方法

Best approach for BigQuery data transformations

etl

google-bigquery

google-cloud-dataflow