如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业？

Question

我想要运行 Python Google Cloud Dataflow 作业和自定义 Docker 图像。

根据文档，这应该是可能的：https://beam.apache.org/documentation/runtime/environments/#testing-customized-images

为了尝试此功能，我使用此 public 存储库 https://github.com/swartchris8/beam_wordcount_with_docker

中的文档中的命令行选项设置了基本的 wordcount 示例管道

我可以运行使用 PortableRunner 在本地 使用 apachebeam/python3.6_sdk 图像进行字数统计工作，但是 使用 Dataflow 我无法做到这个.

我正在尽可能密切地关注 PortableRunner 的文档，我的参数是：

python -m wordcount --input wordcount.py \
--output counts \
--runner=PortableRunner \
--job_endpoint=embed \
--environment_config=apachebeam/python3.6_sdk

对于数据流：

python -m wordcount --input wordcount.py \
--output gs://healx-pubmed-ingestion-tmp/test/wordcount/count/count \
--runner=DataflowRunner \
--project=healx-pubmed-ingestion \
--job_name=dataflow-wordcount-docker \
--temp_location=gs://healx-pubmed-ingestion-tmp/test/wordcount/tmp \
--experiment=beam_fn_api \
--sdk_location=/Users/chris/beam/sdks/python/container/py36/build/target/apache-beam.tar.gz \
--worker_harness_container_image=apachebeam/python3.6_sdk \
--region europe-west1 \
--zone europe-west1-c

有关完整的详细信息，请参阅链接的存储库。

我在这里做错了什么，或者 Dataflow 中的 Python 个作业不支持此功能？

Answer 1

不幸的是，Dataflow 目前使用自己的（不兼容的）工作容器，但正在积极修复此问题。

Answer 2

您应该能够将自定义容器与带有 --experiment=--use_runner_v2 的 Dataflow 一起使用，这将很快默认启用。示例命令行可能如下所示：

pip install apache-beam[gcp]==2.24.0
python -m apache_beam.examples.wordcount \
--output gs://healx-pubmed-ingestion-tmp/test/wordcount/ \
--runner=DataflowRunner \
--project=healx-pubmed-ingestion \
--region europe-west1 \
--temp_location=gs://healx-pubmed-ingestion-tmp/test/wordcount/tmp \
--worker_harness_container_image=apache/beam_python3.6_sdk:2.24.0 \
--experiment=use_runner_v2

要自定义容器，请按照 https://beam.apache.org/documentation/runtime/environments/#customizing-container-images 上的说明进行操作。

如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业？

How to run a Python Google Cloud Dataflow job with a custom Docker image?

docker

google-cloud-dataflow

apache-beam

python-3.6

如何使用自定义 Docker 图像 运行 Python Google Cloud Dataflow 作业？

How to run a Python Google Cloud Dataflow job with a custom Docker image?

docker

google-cloud-dataflow

apache-beam

python-3.6

如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业？