如何访问 Dataproc 集群元数据?

how to access Dataproc cluster metadata?

创建集群后,我试图检索我的附加组件的 URL 地址(不使用 GCP 仪表板)。我正在使用 de Dataproc python API ,更具体地说是 get_cluster() 函数。

该函数返回了大量数据,但我无法找到 Jupyter 网关 URL 或其他元数据。

from google.cloud import dataproc_v1

project_id, cluster_name = '', ''
region = 'europe-west4'

client = dataproc_v1.ClusterControllerClient(
                       client_options={
                            'api_endpoint': '{}-dataproc.googleapis.com:443'.format(region)
                        }
                    )


response = client.get_cluster(project_id, region, cluster_name)
print(response)

有没有人解决这个问题?

如果你已经关注了this doc to setup Jupyter access by enabling Component Gateway, then you can access the Web Interfaces as described here. The trick is that this is included in the API response for the v1beta2版本。

代码中所需的更改很少(除了 google-cloud-dataproc 库之外没有其他要求)。只需将 dataproc_v1 替换为 dataproc_v1beta2 并使用 response.config.endpoint_config:

访问端点
from google.cloud import dataproc_v1beta2

project_id, cluster_name = '', ''
region = 'europe-west4'

client = dataproc_v1beta2.ClusterControllerClient(
                       client_options={
                            'api_endpoint': '{}-dataproc.googleapis.com:443'.format(region)
                        }
                    )


response = client.get_cluster(project_id, region, cluster_name)
print(response.config.endpoint_config)

在我的例子中,我得到:

http_ports {
  key: "HDFS NameNode"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/hdfs/dfshealth.html"
}
http_ports {
  key: "Jupyter"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/jupyter/"
}
http_ports {
  key: "JupyterLab"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/jupyter/lab/"
}
http_ports {
  key: "MapReduce Job History"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/jobhistory/"
}
http_ports {
  key: "Spark History Server"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/sparkhistory/"
}
http_ports {
  key: "Tez"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/apphistory/tez-ui/"
}
http_ports {
  key: "YARN Application Timeline"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/apphistory/"
}
http_ports {
  key: "YARN ResourceManager"
  value: "https://REDACTED-dot-europe-west4.dataproc.googleusercontent.com/yarn/"
}
enable_http_port_access: true

你需要v1beat2

启用组件:

'endpoint_config': {
                'enable_http_port_access': True
            },

那么上面的答案就可以了:

client.get_cluster(project_id, region, cluster_name)