Dataproc 通过 Python 客户端提交 Hadoop 作业
Dataproc submit a Hadoop job via Python client
我正在尝试使用 Dataproc API,方法是尝试将 gcloud 命令转换为 API,但我在文档中找不到很好的示例。
%pip install google-cloud-dataproc
我找到的唯一好的示例是这个,效果很好:
from google.cloud import dataproc_v1
client = dataproc_v1.ClusterControllerClient()
project_id = 'test-project'
region = 'global'
for element in client.list_clusters(project_id, region):
print('Dataproc cluster name:', element.cluster_name)
我需要将以下 gcloud 命令转换为 Python 代码:
gcloud dataproc jobs submit hadoop --cluster "${CLUSTER_NAME}" \
--class com.mycompany.product.MyClass \
--jars "${JAR_FILE}" -- \
--job_venv=venv.zip \
--job_binary_path=venv/bin/python3.5 \
--job_executes program.py \
这个有效:
project_id = 'your project'
region = 'global'
# Define Job arguments:
job_args = ['--job_venv=venv.zip',
'--job_binary_path=venv/bin/python3.5',
'--job_executes program.py']
job_client = dataproc_v1.JobControllerClient()
# Create Hadoop Job
hadoop_job = dataproc_v1.types.HadoopJob(jar_file_uris=[JAR_FILE], main_class='com.mycompany.product.MyClass',args=job_args)
# Define Remote cluster to send Job
job_placement = dataproc_v1.types.JobPlacement()
job_placement.cluster_name = 'your_cluster_name'
# Define Job configuration
main_job = dataproc_v1.types.Job(hadoop_job=hadoop_job, placement=job_placement)
# Send job
job_client.submit_job(project_id, region, main_job)
# Monitor in Dataproc UI or perform another API call to track status
我正在尝试使用 Dataproc API,方法是尝试将 gcloud 命令转换为 API,但我在文档中找不到很好的示例。
%pip install google-cloud-dataproc
我找到的唯一好的示例是这个,效果很好:
from google.cloud import dataproc_v1
client = dataproc_v1.ClusterControllerClient()
project_id = 'test-project'
region = 'global'
for element in client.list_clusters(project_id, region):
print('Dataproc cluster name:', element.cluster_name)
我需要将以下 gcloud 命令转换为 Python 代码:
gcloud dataproc jobs submit hadoop --cluster "${CLUSTER_NAME}" \
--class com.mycompany.product.MyClass \
--jars "${JAR_FILE}" -- \
--job_venv=venv.zip \
--job_binary_path=venv/bin/python3.5 \
--job_executes program.py \
这个有效:
project_id = 'your project'
region = 'global'
# Define Job arguments:
job_args = ['--job_venv=venv.zip',
'--job_binary_path=venv/bin/python3.5',
'--job_executes program.py']
job_client = dataproc_v1.JobControllerClient()
# Create Hadoop Job
hadoop_job = dataproc_v1.types.HadoopJob(jar_file_uris=[JAR_FILE], main_class='com.mycompany.product.MyClass',args=job_args)
# Define Remote cluster to send Job
job_placement = dataproc_v1.types.JobPlacement()
job_placement.cluster_name = 'your_cluster_name'
# Define Job configuration
main_job = dataproc_v1.types.Job(hadoop_job=hadoop_job, placement=job_placement)
# Send job
job_client.submit_job(project_id, region, main_job)
# Monitor in Dataproc UI or perform another API call to track status