如何解决抑制工作人员数量 Google Cloud Dataflow 可以启动的 403 错误?

How to resolve a 403 error inhibiting number of workers Google Cloud Dataflow can spin up?

我知道有相关问题,但我已经为此工作了几个小时。我正在尝试使用带有 Google Cloud Dataflow 的 Apache Beam 管道通过 tensorflow 数据集处理和下载名为 C4 的 Common Crawl 数据集的清理版本,以在数百名工作人员之间分配工作量。按照 instructions for generating big datasets with Apache Beam, I followed the Google Cloud Dataflow Quickstart instructions,通过 Google Cloud Console 设置我的项目、账单、凭据等,然后创建虚拟环境,安装 tensorflow 和 google cloud sdk ,然后使用 export GOOGLE_APPLICATION_CREDENTIALS="path/to/json/from/Google-Cloud" 设置凭据。当我然后设置 MY_BUCKET、MY_PROJECT 和 MY_REGION 变量时,我实际上 运行 指令

pip install tfds-nightly[c4]
echo 'tfds-nightly[c4]' > /tmp/beam_requirements.txt
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=c4/en \
  --data_dir=gs://$MY_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service,region=$MY_REGION"

它开始 运行,但我每 20 秒收到一条 403 错误消息“Failed:Resize Instance Group Manager”

当控制台输出一直说它正在尝试扩展到 1000 时,我被限制为 2 个工作人员。

.
.
.
I1014 15:43:05.446238 140556195309312 dataflow_runner.py:248] 2020-10-14T21:43:01.141Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
I1014 15:43:05.446516 140556195309312 dataflow_runner.py:248] 2020-10-14T21:43:01.171Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
I1014 15:49:42.444391 140556195309312 dataflow_runner.py:248] 2020-10-14T21:49:40.042Z: JOB_MESSAGE_BASIC: Autoscaling: Resizing worker pool from 1 to 2.
I1014 15:49:47.653243 140556195309312 dataflow_runner.py:248] 2020-10-14T21:49:45.542Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 2 based on
the rate of progress in the currently running stage(s).
I1014 15:51:37.070624 140556195309312 dataflow_runner.py:248] 2020-10-14T21:51:36.931Z: JOB_MESSAGE_BASIC: Autoscaling: Resizing worker pool from 2 to 1000.
I1014 16:39:49.413619 140556195309312 transport.py:179] Refreshing due to a 401 (attempt 1/2)
I1014 16:39:49.448023 140556195309312 client.py:795] Refreshing access_token
I1014 17:39:53.158122 140556195309312 transport.py:179] Refreshing due to a 401 (attempt 1/2)
I1014 17:39:53.191963 140556195309312 client.py:795] Refreshing access_token
I1014 18:39:54.347596 140556195309312 transport.py:179] Refreshing due to a 401 (attempt 1/2)
I1014 18:39:54.377913 140556195309312 client.py:795] Refreshing access_token
I1014 19:39:59.015963 140556195309312 transport.py:179] Refreshing due to a 401 (attempt 1/2)
I1014 19:39:59.051589 140556195309312 client.py:795] Refreshing access_token
.
.
.

我认为这是一个权限问题,来自 403 错误和相关问题,但我按照说明创建了一个服务帐户并将其设为“所有者”,所以如果我将我的凭据设置为 json 我从 GC 控制台获得的文件,为什么我会缺少权限,我该如何验证我的凭据等,以便停止出现这些 403 错误并成功启动数百个工作人员?

最后,我认为这可能是配额问题,但在通过 Google Cloud Console 检查配额时,该作业似乎没有遇到任何配额。他们每个人都有一个绿色的勾号,而且还差得远。

QUOTA_FOR_INSTANCES 引用此 quota,它在配额页面中不可见。要增加配额,请增加 CPU 配额。

“如果您需要更多 VM 实例的配额,请请求更多 CPUs,因为拥有更多 CPUs 会增加此配额。”

您还可以设置 max_num_workers 以将 VM 数量保持在配额范围内。

403 可能意味着权限被拒绝,因为您没有足够的配额,而不是因为您的 SA 没有任何其他权限。