GCP 云运行上的长期运行ning 作业

Question

我正在从 BigQuery 读取 1000 万条记录并进行一些转换并创建 .csv 文件，我使用 Node.JS.

上传到 SFTP 服务器的相同 .csv 流数据

此作业大约需要 5 到 6 小时才能在本地完成请求。

解决方案已在 GCP Cloud 运行上部署，但在 2 到 3 秒后云运行正在关闭容器并出现 503 错误。

请在下面找到 GCP 云的配置运行。

自动缩放：最多 1 个容器实例 CPU分配：默认内存分配：2Gi 并发数：10 请求超时：900 秒

GCP Cloud 运行是长运行ning 后台进程的好选择吗？

Answer 1

您可以尝试使用 Apache Beam pipeline deployed via Cloud Dataflow。使用 Python，您可以通过以下步骤执行任务：

阶段 1. 从 BigQuery 中读取数据 table。

beam.io.Read(beam.io.BigQuerySource(query=your_query,use_standard_sql=True))

第 2 阶段。将第 1 阶段的结果上传到 GCS 存储桶上的 CSV 文件中。

beam.io.WriteToText(file_path_prefix="", \
                    file_name_suffix='.csv', \
                    header='list of csv file headers')

第 3 阶段。调用 ParDo 函数，该函数将获取在第 2 阶段创建的 CSV 文件并将其上传到 SFTP 服务器。可以参考thislink。

Answer 2

您可以考虑采用无服务器、事件驱动的方法：

配置google存储触发云函数运行改造
extract/export BigQuery to CF trigger bucker - 这是获取 BigQuery 数据的最快方式

有时以这种方式导出的数据可能太大，由于最大执行时间（当前为 9 分钟）或内存限制 2GB 等限制，不适合以这种形式进行 Cloud Function 处理，在这种情况下，您可以将原始数据文件拆分为更小的部分 and/or 然后推送到 Pub/Sub 和 storage mirror

综上所述，我们已经使用 CF 在几分钟内端到端地处理了从构建布隆过滤器到将数据发布到 aerospike 的十亿条记录。

Answer 3

Is GCP Cloud Run is good option for long running background process?

这不是一个好的选择，因为您的容器是 'brought to life' 传入的 HTTP 请求，一旦容器响应（例如发回某些内容），Google 假定请求的处理已完成并且切断 CPU。

这可以解释：

Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.

Answer 4

您可以使用部署了容器的 VM 实例并在其上执行作业。最后杀死或停止您的 VM。

但是，就我个人而言，我更喜欢无服务器解决方案和方法，例如 Cloud 运行。不过，Long 运行ning job on Cloud 运行总有一天会到来的！在此之前，您必须处理 60 分钟的限制或使用其他服务。

作为解决方法，我建议您使用 Cloud Build。是的，Cloud Build for 运行ning 中的任何容器。我wrote an article on this。我运行 Cloud Build 上的 Terraform 容器，但实际上，您可以运行任何容器。

设置 timeout correctly, take care of default service account and assigned role, and, thing not yet available on Cloud Run, choose the number of CPUs (1, 8 or 32) for the processing 并加快您的进程。

想要奖金吗？您有 120 minutes free per day and per billing account（注意，不是每个项目！）

Answer 5

我将尝试使用 Dataflow 从 Big Query 创建 .csv 文件并将该文件上传到 GCS。

Answer 6

更新：2021 年 10 月

Cloudrun 支持后台活动。

Configure CPU to be always-allocated if you use background activities
Background activity is anything that happens after your HTTP response has been delivered. To determine whether there is background activity in your service that is not readily apparent, check your logs for anything that is logged after the entry for the HTTP request.

Configure CPU to be always-allocated
If you want to support background activities in your Cloud Run service, set your Cloud Run service CPU to be always allocated so you can run background activities outside of requests and still have CPU access.

GCP 云运行上的长期运行ning 作业

long-running job on GCP cloud run

node.js

google-bigquery

google-cloud-platform

google-cloud-run

GCP 云 运行 上的长期 运行ning 作业

long-running job on GCP cloud run

node.js

google-bigquery

google-cloud-platform

google-cloud-run

GCP 云运行上的长期运行ning 作业