如何在创建 Dataproc 集群时 运行 Bash 编写脚本作为初始化操作?
How to Run Bash Script As Initialization Action Upon Creating Dataproc Cluster?
我希望 Dataproc 集群下载我创建的自定义库,该库不能通过 pip 安装,因此需要用户从云源代码库中克隆它,然后执行 sudo python setup.py install
。我尝试创建一个 bash 脚本;集群的创建没有任何问题,但我不认为它是 运行 bash 脚本,因为我没有注意到任何变化。
这是我要初始化到集群的 bash 脚本:
#! /bin/bash
# download jars
gsutil -m cp gs://dataproc-featurelib/spark-lib/*.jar .
# download credential files
gsutil -m cp gs://mlflow_feature_pipeline/secrets/*.json .
# install feature_library
gcloud source repos clone feature_library --project=<project_id>
cd feature_library
sudo python3 setup.py install
cd ../
以下是我设置集群的方式:
gcloud beta dataproc clusters create featurelib-cluster \
--zone=us-east1-b \
--master-machine-type n1-highmem-16 \
--worker-machine-type n1-highmem-16 \
--num-workers 4 \
--image-version 1.4-debian9 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh,gs://dataproc-featurelib/initialization-scripts/dataproc_featurelib_init.sh \
--metadata 'PIP_PACKAGES=google-cloud-storage hvac cryptography mlflow sqlalchemy snowflake-sqlalchemy snowflake-connector-python snowflake' \
--optional-components=ANACONDA \
--enable-component-gateway \
--project <project_id> \
--autoscaling-policy=featurelib-policy \
--tags feature-lib \
--no-address \
--subnet composer-us-east1 \
--bucket dataproc-featurelib
我通过授权服务帐户解决了这个问题。 Bash 下面的脚本示例:
#! /bin/bash
# download jars
gsutil -m cp gs://dataproc-featurelib/spark-lib/*.jar .
# download credential files
gsutil -m cp gs://mlflow_feature_pipeline/secrets/*.json .
# authenticate
gcloud config set account <gserviceaccount_email_id>
gcloud auth activate-service-account <gserviceaccount_email_id> --project=dao-aa-poc-uyim --key-file=<path_to_key_file>
# install package
gcloud source repos clone feature_library --project=<project_id>
cd feature_library
python3 setup.py install
cd ../
我希望 Dataproc 集群下载我创建的自定义库,该库不能通过 pip 安装,因此需要用户从云源代码库中克隆它,然后执行 sudo python setup.py install
。我尝试创建一个 bash 脚本;集群的创建没有任何问题,但我不认为它是 运行 bash 脚本,因为我没有注意到任何变化。
这是我要初始化到集群的 bash 脚本:
#! /bin/bash
# download jars
gsutil -m cp gs://dataproc-featurelib/spark-lib/*.jar .
# download credential files
gsutil -m cp gs://mlflow_feature_pipeline/secrets/*.json .
# install feature_library
gcloud source repos clone feature_library --project=<project_id>
cd feature_library
sudo python3 setup.py install
cd ../
以下是我设置集群的方式:
gcloud beta dataproc clusters create featurelib-cluster \
--zone=us-east1-b \
--master-machine-type n1-highmem-16 \
--worker-machine-type n1-highmem-16 \
--num-workers 4 \
--image-version 1.4-debian9 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh,gs://dataproc-featurelib/initialization-scripts/dataproc_featurelib_init.sh \
--metadata 'PIP_PACKAGES=google-cloud-storage hvac cryptography mlflow sqlalchemy snowflake-sqlalchemy snowflake-connector-python snowflake' \
--optional-components=ANACONDA \
--enable-component-gateway \
--project <project_id> \
--autoscaling-policy=featurelib-policy \
--tags feature-lib \
--no-address \
--subnet composer-us-east1 \
--bucket dataproc-featurelib
我通过授权服务帐户解决了这个问题。 Bash 下面的脚本示例:
#! /bin/bash
# download jars
gsutil -m cp gs://dataproc-featurelib/spark-lib/*.jar .
# download credential files
gsutil -m cp gs://mlflow_feature_pipeline/secrets/*.json .
# authenticate
gcloud config set account <gserviceaccount_email_id>
gcloud auth activate-service-account <gserviceaccount_email_id> --project=dao-aa-poc-uyim --key-file=<path_to_key_file>
# install package
gcloud source repos clone feature_library --project=<project_id>
cd feature_library
python3 setup.py install
cd ../